Use cases for using multiple queries for Spark Structured streaming

Apparently, you can use regex pattern for consuming the data from different kafka topics.

Lets say, you have topic names like “topic-ingestion1”, “topic-ingestion2” – then you can create a regex pattern for consuming data from all topics ending with “*ingestion”.

Once the new topic gets created in the format of your regex pattern – spark will automatically start streaming data from the newly created topic.

Reference: [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

you can use this parameter to specify your cache timeout. “spark.kafka.consumer.cache.timeout”.

From the spark documentation:

spark.kafka.consumer.cache.timeout – The minimum amount of time a consumer may sit idle in the pool before it is eligible for eviction by the evictor.

Lets say if you have multiple sinks where you are reading from kafka and you are writing it into two different locations like hdfs and hbase – then you can branch out your application logic into two writeStreams.

If the sink (Greenplum) supports batch mode of operations – then you can look at forEachBatch() function from spark structured streaming. It will allow us to reuse the same batchDF for both the operations.

Reference: [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#consumer-caching]

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top