Page MenuHomePhabricator

[Flink Operations] How to handle restarting a Flink application
Closed, ResolvedPublic5 Estimated Story Points


User Story
As an event platform engineer, I need to understand how I can restart a Flink application from the point in time that it failed
  • So that restarts can be handled cleanly, with minimal impact and with minimal manual intervention
Done is:
  • Process for restarts is documented in runbook (major potential failure points are documented and process at each step is documented)
  • Storage requirements for state and approach is documented


ReferenceSource BranchDest BranchAuthorTitle
repos/data-engineering/mediawiki-event-enrichment!11enable-flink-pluginsmaingmodenaEnable s3-fs-presto plugin
Customize query in GitLab

Event Timeline

lbowmaker renamed this task from How to handle restarting a Flink application to [Flink Operations] How to handle restarting a Flink application.Feb 1 2023, 2:56 PM
lbowmaker created this task.
lbowmaker moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.
gmodena moved this task from Next Up to In Progress on the Event-Platform (Sprint 09) board.
gmodena set the point value for this task to 5.

I have a working setup on minikube that manages restarts and HA using the flink k8s operator, minio (for checkpointing) and the helm template.

As a next step, I'd like to scale up experiments to DSE. @Ottomata need to touch base and validate some assumptions of our current deployment.
Flink HA services require _to start the JobManager and TaskManager pods with a service account which has the permissions to create, edit, delete ConfigMaps._ (

Does our service account meet this requirements?

We'll also need storage for application checkpointing. This is required for HA services (e.g. restart strategy), and we won't able to fallback to the Job Manager heap. I reached out to SRE Data Persistence for info re onboarding to Swift.

permissions to create, edit, delete ConfigMaps

Yes, we got it!

FWIW, we MAYYYBYE will want to use Zookeeper for the HA state. This would make k8s cluster restarts easier on Service Ops. Modified ConfigMap state is not restored on k8s cluster restart, so if we need to persist it, we have to do so manually when they want to restart k8s.