Remove the presto client for swift from the flink image
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• dcausse
	Mar 29 2022, 8:15 AM

Description

As a maintainer of a flink session cluster I want to stop using the presto client for swift present in the flink image so that I can migrate to newer version of flink since it was removed.

This is a followup of T302494 where we dropped this dependency from the jobs running in the flink session cluster. This task is about dropping this swift client from the image.

Existing flink session clusters rely on this swift client to store their H/A related data (e.g. job jars). This means we must migrate existing clusters to using s3 as a simple drop-in replacement is unlikely to work.

Suggested migration procedure:

For codfw
- route wdqs & wcqs to eqiad only
- adapt the wikidata maxlag to poll eqiad only
- stop (with a savepoint) all the jobs (WDQS & WCQS) running on the codfw k8s wikikube cluster 5m
- undeploy all the k8s deployments under the rdf-streaming-updater namespace (dropping all flink generated configmaps might be necessary by e.g. recreating the k8s namespace) 10m
- delete the flink_ha_storage folder on the corresponding s3 bucket (rdf-streaming-updater-codfw) TODO: Good time estimate
- drop presto-swift from https://gerrit.wikimedia.org/g/wikidata/query/flink-rdf-streaming-updater and create a new image zero m
- adapt the patch generated by PipelineLib when merging the patch above and remove all mentions to swift from deployment-charts (possibly adapting existing patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/766123) 10m
- deploy the chart to the rdf-streaming-updater namespace in codfw (which should be empty) 10m
- deploy the flink jobs (WCQS & WDQS) from their corresponding savepoints 10m
- repool codfw & resume polling codfw for wikidata maxlag calculation 10m
For eqiad (do all the above replacing eqiad with codfw and vice versa)

stop (with a savepoint) all the jobs (WDQS & WCQS) running on the eqiad k8s wikikube cluster 5m
undeploy all the k8s deployments under the rdf-streaming-updater namespace (dropping all flink generated configmaps might be necessary by e.g. recreating the k8s namespace) 10m
delete the flink_ha_storage folder on the corresponding s3 bucket (rdf-streaming-updater-eqiad) ???
Merge & deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889155 10m
deploy the flink jobs (WCQS & WDQS) from their corresponding savepoints 10m

Note that most of this procedure can be tested against the staging cluster (omitting the parts about the routing live traffic and wikidata maxlag)

AC:

none of the flink session clusters are using the presto swift client

Details

Subject	Repo	Branch	Lines +/-
rdf-streaming-updater: Use S3 instead of Swift for bucket access	operations/deployment-charts	master	+1 -1
deploy: add options for different k8s staging envs	wikidata/query/deploy	master	+5 -2
flink-rdf-streaming-updater: use S3 instead of swift	operations/deployment-charts	master	+1 -1
flink-rdf-streaming-updater: use S3 instead of swift	operations/deployment-charts	master	+1 -1
flink-rdf-streaming-updater: use S3 instead of swift	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model
T314835: wdqs space usage on thanos-swift
T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol
Mentioned Here: T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol

Event Timeline

• dcausse created this task.Mar 29 2022, 8:15 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 29 2022, 8:15 AM

• dcausse mentioned this in T302494: The WDQS Streaming Updater should use S3 to access thanos-swift instead of the native swift protocol.Mar 29 2022, 8:17 AM

Maintenance_bot added a project: Wikidata.Mar 29 2022, 8:45 AM

• MPhamWMF triaged this task as Medium priority.Apr 4 2022, 3:44 PM

• MPhamWMF moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

• dcausse mentioned this in T314835: wdqs space usage on thanos-swift.Aug 9 2022, 2:11 PM

bking subscribed.Aug 22 2022, 4:34 PM

EBernhardson subscribed.Aug 22 2022, 4:38 PM

• dcausse moved this task from Operations/SRE to Current work on the Wikidata-Query-Service board.Dec 19 2022, 10:52 AM

• dcausse added a project: Discovery-Search (Current work).

• dcausse mentioned this in T326409: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model.Jan 24 2023, 2:04 PM

Change 885365 had a related patch set uploaded (by Bking; author: Bking):