Turns out if I specify only the table name in table_source it will load all the data into spark cluster.
To select specific data that I need, I can use something just like this:
last_update = '2020-10-15 00:00:00'
table_source = "(SELECT employee_id, employee_name, department, created_at, updated_at " \
"FROM get_specific_data WHERE created_at > '{0}') t1".format(last_update)
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://{ip_address}/{database}") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "org.postgresql.Driver") \
.load()
# And then write the data to parquet file
get_specific_data.write.parquet("gs://{bucket_name}/{target_directory}/")
CLICK HERE to find out more related problems solutions.