As a maintainer of a flink session cluster I want to stop using the presto client for swift present in the flink image so that I can migrate to newer version of flink since it was removed.
This is a followup of T302494 where we dropped this dependency from the jobs running in the flink session cluster. This task is about dropping this swift client from the image.
Existing flink session clusters rely on this swift client to store their H/A related data (e.g. job jars). This means we must migrate existing clusters to using s3 as a simple drop-in replacement is unlikely to work.
Suggested migration procedure:
- For codfw
- route wdqs & wcqs to eqiad only
- adapt the wikidata maxlag to poll eqiad only
- stop (with a savepoint) all the jobs (WDQS & WCQS) running on the codfw k8s wikikube cluster 5m
- undeploy all the k8s deployments under the rdf-streaming-updater namespace (dropping all flink generated configmaps might be necessary by e.g. recreating the k8s namespace) 10m
- delete the flink_ha_storage folder on the corresponding s3 bucket (rdf-streaming-updater-codfw) TODO: Good time estimate
- drop presto-swift from https://gerrit.wikimedia.org/g/wikidata/query/flink-rdf-streaming-updater and create a new image zero m
- adapt the patch generated by PipelineLib when merging the patch above and remove all mentions to swift from deployment-charts (possibly adapting existing patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/766123) 10m
- deploy the chart to the rdf-streaming-updater namespace in codfw (which should be empty) 10m
- deploy the flink jobs (WCQS & WDQS) from their corresponding savepoints 10m
- repool codfw & resume polling codfw for wikidata maxlag calculation 10m
- For eqiad (do all the above replacing eqiad with codfw and vice versa)
- stop (with a savepoint) all the jobs (WDQS & WCQS) running on the eqiad k8s wikikube cluster 5m
- undeploy all the k8s deployments under the rdf-streaming-updater namespace (dropping all flink generated configmaps might be necessary by e.g. recreating the k8s namespace) 10m
- delete the flink_ha_storage folder on the corresponding s3 bucket (rdf-streaming-updater-eqiad) ???
- Merge & deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889155 10m
- deploy the flink jobs (WCQS & WDQS) from their corresponding savepoints 10m
Note that most of this procedure can be tested against the staging cluster (omitting the parts about the routing live traffic and wikidata maxlag)
AC:
- none of the flink session clusters are using the presto swift client