You can use the expr
function.
I’m using regexp_extract
to extract the first 4 digits from the dataset
column and regexp_replace
to replace the last 4 digits of the topic
column with the output of regexp_extract
.
Regex for first 4 digits: (^[0-9]{4})
Regex for last 4 digits: ([0-9]{4}$)
from pyspark.sql.functions import expr
df.withColumn("dataset_year",expr("regexp_extract(dataset, '(^[0-9]{4})')"))\
.withColumn("topic",expr("regexp_replace(topic, '([0-9]{4}$)'\
, dataset_year)")).drop('dataset_year').show(truncate=False)
+-------+-------------------+----------------------------+
|dataset|id |topic |
+-------+-------------------+----------------------------+
|2020A |1128290566331031552|papuaNewguineaEarthquake2020|
|2020A |1128293303659716608|papuaNewguineaEarthquake2020|
|2020A |1152200235847966726|athensEarthquake2020 |
|2020A |1152204892083281920|athensEarthquake2020 |
|2020A |1152220394008522753|athensEarthquake2020 |
+-------+-------------------+----------------------------+
CLICK HERE to find out more related problems solutions.