how can i efficiently select partial data from rdbms tables in pyspark?

Turns out if I specify only the table name in table_source it will load all the data into spark cluster.

To select specific data that I need, I can use something just like this:

last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
               "FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)

df = spark.read.format("jdbc") \
    .option("url", "jdbc:postgresql://{ip_address}/{database}") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "org.postgresql.Driver") \
    .load()

# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top