Add columns to pyspark dataframe if not exists

You can do like below,

from pyspark import Row
from pyspark.sql import functions as F
row = Row('id', 'Name', 'age', 'gender')
row_df = spark.createDataFrame(
    [row(1, 'Test', '12', 'Male'), row(2, 'Test2', '15', 'Female')])
row_df.show()

if 'gender' not in row_df.columns:
    row_df = row_df.withColumn('gender', F.lit(None))
if 'city' not in row_df.columns:
    row_df = row_df.withColumn('city', F.lit(None))
if 'contact' not in row_df.columns:
    row_df = row_df.withColumn('contact', F.lit(None))

row_df.show()

Output:

+---+-----+---+------+
| id| Name|age|gender|
+---+-----+---+------+
|  1| Test| 12|  Male|
|  2|Test2| 15|Female|
+---+-----+---+------+

+---+-----+---+------+----+-------+
| id| Name|age|gender|city|contact|
+---+-----+---+------+----+-------+
|  1| Test| 12|  Male|null|   null|
|  2|Test2| 15|Female|null|   null|
+---+-----+---+------+----+-------+

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top