The WDQS streaming updater is using a session cluster mode but the work done on T324576 should allow us to use a better approach based on the flink-k8s-operator.
Once the operator available we should migrate the WDQS streaming updater to this new deployment model.
Preparation steps:
- T304914 fully remove any references to swift:// resources in eqiad
- T289836 deploy flink 1.16 to the existing flink-session-cluster deployments in wikikube and run the streaming updater on top of it
- Test the updater job on the dse-k8s cluster
- create a namespace for the rdf-streaming-updater on the dse-k8s cluster: https://gerrit.wikimedia.org/r/c/operations/puppet/+/882748
- T328675 create a helmfile service using the FlinkDeployment resource via the flink-app helm chart
- T341792 Provision Zookeeper Cluster for storing Flink HA data
- T344614 Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster
- test various maintenance operations: taking savepoint, job upgrade, H/A recoveries (kill pods manually), k8s upgrade (wipe out the namespace, T293063), ... (see also T342149, T328561)
- Enable the k8s-operator on the staging wikikube cluster for the rdf-streaming-updater namespace (might need a dedicated task)
- Enable the k8s-operator on the production wikikube cluster for the rdf-streaming-updater namespace (might need a dedicated task)
Migration steps:
- eqiad:
- depool WDQS&WCQS read traffic in eqiad
- stop the production jobs with a savepoint using the python tools
- update the k8s chart to set the right savepoint path for both jobs
- deploy the jobs
- repool WDQS&WCQS in eqiad
- codfw:
- depool WDQS&WCQS read traffic in codfw
- stop the production jobs with a savepoint using the python tools
- update the k8s chart to set the right savepoint path for both jobs
- deploy the jobs
- repool WDQS&WCQS in codfw
Cleanups: moved to T350784
AC:
- flink-rdf-streaming-updater repo should be updated to use the base flink image and bundle the application jar in it
- the rdf-streaming-updater chart is updated (or a new one is created) to use the flink-k8s-operator
- the flink-session-cluster chart is removed
- runbooks&cookbooks are adapted T293063: Write and adapt Runbooks and cookbooks related to the WDQS Streaming Updater and kubernetes