How to remove rows from pyspark dataframe using pattern matching?

Question

How to remove rows from pyspark dataframe using pattern matching?

I have a pyspark dataframe read from a CSV file that has a value column which contains hexadecimal values.

| date     | part  | feature | value"       |
|----------|-------|---------|--------------|
| 20190503 | par1  | feat2   | 0x0          |
| 20190503 | par1  | feat3   | 0x01         |
| 20190501 | par2  | feat4   | 0x0f32       |
| 20190501 | par5  | feat9   | 0x00         |
| 20190506 | par8  | feat2   | 0x00f45      |
| 20190507 | par1  | feat6   | 0x0e62300000 |
| 20190501 | par11 | feat3   | 0x000000000  |
| 20190501 | par21 | feat5   | 0x03efff     |
| 20190501 | par3  | feat9   | 0x000        |
| 20190501 | par6  | feat5   | 0x000000     |
| 20190506 | par5  | feat8   | 0x034edc45   |
| 20190506 | par8  | feat1   | 0x00000      |
| 20190508 | par3  | feat6   | 0x00000000   |
| 20190503 | par4  | feat3   | 0x0c0deffe21 |
| 20190503 | par6  | feat4   | 0x0000000000 |
| 20190501 | par3  | feat6   | 0x0123fe     |
| 20190501 | par7  | feat4   | 0x00000d0    |

The requirement is to remove rows that contain values similar to 0x0, 0x00, 0x000, etc. which evaluate to decimal 0(zero) in the value column. The number of 0's after '0x' varies across the dataframe. Removing through pattern matching is the way I tried, but I wasn't successful.

myFile = sc.textFile("file.txt")
header = myFile.first()

fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]

myFile_header = myFile.filter(lambda l: "date" in l)
myFile_NoHeader = myFile.subtract(myFile_header)

myFile_df = myFile_NoHeader.map(lambda line: line.split(",")).toDF(schema)

## this is the pattern match I tried 
result = myFile_df.withColumn('Test', regexp_extract(col('value'), '(0x)(0\1*\1*)',2 ))
result.show()

The other approach I used was using udf:

def convert_value(x):
    return int(x,16)

Using this udf in pyspark give me

ValueError: invalid literal for int() with base 16: value

pyspark

asked on Stack Overflow May 12, 2019 by

user10565646 • edited May 13, 2019 by

Gino Mempin

1 Answer

I don't really understand your regular expression, but when you want to match all strings containing 0x0 (+any number of zeros), then you can use ^0x0+$. Filtering with regular expression can be achieved with rlike and the tilde negates the match.

l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('20190501', 'par2', 'feat4', '0x0f32'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('20190507', 'par1', 'feat6', '0x0e62300000'),
('20190501', 'par11', 'feat3', '0x000000000'),
('20190501', 'par21', 'feat5', '0x03efff'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('20190506', 'par5', 'feat8', '0x034edc45'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]

columns = ['date', 'part', 'feature', 'value']

df=spark.createDataFrame(l, columns)

expr = "^0x0+$"
df.filter(~ df["value"].rlike(expr)).show()

Output:

+--------+-----+-------+------------+ 
|    date| part|feature|       value| 
+--------+-----+-------+------------+ 
|20190503| par1|  feat3|        0x01| 
|20190501| par2|  feat4|      0x0f32| 
|20190506| par8|  feat2|     0x00f45| 
|20190507| par1|  feat6|0x0e62300000| 
|20190501|par21|  feat5|    0x03efff| 
|20190506| par5|  feat8|  0x034edc45| 
|20190503| par4|  feat3|0x0c0deffe21| 
|20190501| par3|  feat6|    0x0123fe| 
|20190501| par7|  feat4|   0x00000d0| 
+--------+-----+-------+------------+

answered on Stack Overflow May 12, 2019 by

cronoik

User contributions licensed under CC BY-SA 3.0