Apache Flink – FsStateBackend – How state is recovered in case of Task Manager failure which stores state in its local file system

You should always use a distributed file system for checkpointing. Something like HDFS, S3, GFS, NFS, Ceph, etc. Furthermore, the storage path used must be accessible from all participating processes/nodes (i.e. all Task Managers and Job Managers).

Otherwise, as you’ve pointed out, the checkpoint data would be lost if a local disk failed.

The Job Manager has complete knowledge concerning checkpointing, and if you have HA configured, this information is stored in the configured HA storage provider in order to enable Job Manager failover.

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top