Per T328561 , we need to consider and document the different aspects of operating a Flink application (for example: how to handle restarts, upgrades, etc) in the new flink operator/k8s/Flink ZK environment we are currently building .
- Perform/document relevant operations already identified by @gmodena.
- How to handle restarting a Flink application ( ref https://phabricator.wikimedia.org/T328563 )
Automate Replay of Failed Events (ref https://phabricator.wikimedia.org/T328565 )Doesn't seem relevant to us
- Handle app upgrades ( ref https://phabricator.wikimedia.org/T328569 )
Anything that's already done today:
- Initial deployment of the Flink job
- Version upgrade of the Flink job
- Restart the jobs without upgrading: ran kubectl rollout restart deployment flink-app-wdqs on dse-k8s, the Flink jobmanager was able to recover without human intervention.
- Recovery on Flink failure (restarts in the same place). We have successfully restored from both checkpoints and savepoints. Documented here
Test running multiple flink apps controlled by the same flink operator, using the same Zookeeper instance for Flink HA.Crossing out as this doesn't seem to be a blocker; we have confirmation from the Flink mailing list and personal observation that the Flink Kubernetes Operator uses the name of the helm deployment as the cluster id, so namespace collision shouldn't be a problem.
- Stop the operator and see what happens to the application.