Per T328561 , we need to consider and document the different aspects of operating a Flink application (for example: how to handle restarts, upgrades, etc) in the [[ https://phabricator.wikimedia.org/T326409 | new flink operator/k8s/Flink ZK environment we are currently building ]] .
AC:
- Perform/document relevant operations already identified by @gmodena.
-- How to handle restarting a Flink application ( ref https://phabricator.wikimedia.org/T328563 )
-- Automate Replay of Failed Events (ref https://phabricator.wikimedia.org/T328565 )
-- Handle app upgrades ( ref https://phabricator.wikimedia.org/T328569 )
- Anything that's already done today:
-- Initial deployment of the Flink job
-- Version upgrade of the Flink job
-- Restart the jobs without upgrading
-- Recovery on Flink failure (restarts in the same place)
-- Test running multiple flink apps controlled by the same flink operator, using the same Zookeeper instance for Flink HA. For context, it appears [[ https://github.com/apache/flink-kubernetes-operator/blob/dec1205beaf9edfed9a5bc974241fadc866a1e96/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/validation/DefaultValidator.java#L68 | the Flink k8s operator sets this value automatically. ]] (thanks @tchin for the link). It's possible we'll have multiple applications (rdf-streaming-updater and search-update-pipeline) managed by the same Flink operator, using the same ZK cluster. So we want to make sure that the operator doesn't use the same ZK namespace (znode?) for different Flink apps.