join different dataframes using a loop in pyspark

There seem to be several things wrong with the code or perhaps you have not provided the complete code.

  1. Have you defined fullpath?
  2. You have set header=False then how will spark know that there is an “id” column?
  3. Your indentation looks wrong under the for loop.
  4. full_data has not been defined yet, so how are you using it on the right side of the evaluation within the for loop? I suspect you have initialized this to the first csv file and then attempting to join it with first csv again.

I ran a small test on the below code which worked for me and addresses the questions I’ve raised above. You can adjust it to your need.

fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
                      ,header=True, 
                      inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
  n= spark.read.csv(fullpath+n+'.csv'
                      ,header=True, 
                      inferSchema= True)
  full_data=full_data.join(n,["id"])
full_data.show(5)

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top