PySpark: Regex Replace Group

You can use the expr function.

I’m using regexp_extract to extract the first 4 digits from the dataset column and regexp_replace to replace the last 4 digits of the topic column with the output of regexp_extract.

Regex for first 4 digits: (^[0-9]{4})
Regex for last 4 digits: ([0-9]{4}$)

from pyspark.sql.functions import expr

df.withColumn("dataset_year",expr("regexp_extract(dataset, '(^[0-9]{4})')"))\
    .withColumn("topic",expr("regexp_replace(topic, '([0-9]{4}$)'\
    , dataset_year)")).drop('dataset_year').show(truncate=False)

+-------+-------------------+----------------------------+
|dataset|id                 |topic                       |
+-------+-------------------+----------------------------+
|2020A  |1128290566331031552|papuaNewguineaEarthquake2020|
|2020A  |1128293303659716608|papuaNewguineaEarthquake2020|
|2020A  |1152200235847966726|athensEarthquake2020        |
|2020A  |1152204892083281920|athensEarthquake2020        |
|2020A  |1152220394008522753|athensEarthquake2020        |
+-------+-------------------+----------------------------+

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top