Page MenuHomePhabricator

Remove the presto client for swift from the flink image
Closed, ResolvedPublic5 Estimated Story Points

Description

As a maintainer of a flink session cluster I want to stop using the presto client for swift present in the flink image so that I can migrate to newer version of flink since it was removed.

This is a followup of T302494 where we dropped this dependency from the jobs running in the flink session cluster. This task is about dropping this swift client from the image.

Existing flink session clusters rely on this swift client to store their H/A related data (e.g. job jars). This means we must migrate existing clusters to using s3 as a simple drop-in replacement is unlikely to work.

Suggested migration procedure:

  • For codfw
    • route wdqs & wcqs to eqiad only
    • adapt the wikidata maxlag to poll eqiad only
    • stop (with a savepoint) all the jobs (WDQS & WCQS) running on the codfw k8s wikikube cluster 5m
    • undeploy all the k8s deployments under the rdf-streaming-updater namespace (dropping all flink generated configmaps might be necessary by e.g. recreating the k8s namespace) 10m
    • delete the flink_ha_storage folder on the corresponding s3 bucket (rdf-streaming-updater-codfw) TODO: Good time estimate
    • drop presto-swift from https://gerrit.wikimedia.org/g/wikidata/query/flink-rdf-streaming-updater and create a new image zero m
    • adapt the patch generated by PipelineLib when merging the patch above and remove all mentions to swift from deployment-charts (possibly adapting existing patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/766123) 10m
    • deploy the chart to the rdf-streaming-updater namespace in codfw (which should be empty) 10m
    • deploy the flink jobs (WCQS & WDQS) from their corresponding savepoints 10m
    • repool codfw & resume polling codfw for wikidata maxlag calculation 10m
  • For eqiad (do all the above replacing eqiad with codfw and vice versa)
  • stop (with a savepoint) all the jobs (WDQS & WCQS) running on the eqiad k8s wikikube cluster 5m
  • undeploy all the k8s deployments under the rdf-streaming-updater namespace (dropping all flink generated configmaps might be necessary by e.g. recreating the k8s namespace) 10m
  • delete the flink_ha_storage folder on the corresponding s3 bucket (rdf-streaming-updater-eqiad) ???
  • Merge & deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/889155 10m
  • deploy the flink jobs (WCQS & WDQS) from their corresponding savepoints 10m

Note that most of this procedure can be tested against the staging cluster (omitting the parts about the routing live traffic and wikidata maxlag)

AC:

  • none of the flink session clusters are using the presto swift client

Event Timeline

MPhamWMF moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

Change 885365 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-rdf-streaming-updater: use S3 instead of swift

https://gerrit.wikimedia.org/r/885365

Change 885365 merged by Bking:

[operations/deployment-charts@master] flink-rdf-streaming-updater: use S3 instead of swift

https://gerrit.wikimedia.org/r/885365

Change 885392 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-rdf-streaming-updater: use S3 instead of swift

https://gerrit.wikimedia.org/r/885392

Change 885394 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-rdf-streaming-updater: use S3 instead of swift

https://gerrit.wikimedia.org/r/885394

Change 885392 abandoned by Bking:

[operations/deployment-charts@master] flink-rdf-streaming-updater: use S3 instead of swift

Reason:

unplanned merge conflict, discarding

https://gerrit.wikimedia.org/r/885392

Change 885394 merged by Bking:

[operations/deployment-charts@master] flink-rdf-streaming-updater: use S3 instead of swift

https://gerrit.wikimedia.org/r/885394

Change 885425 had a related patch set uploaded (by Bking; author: Bking):

[wikidata/query/deploy@master] deploy: add options for different k8s staging envs

https://gerrit.wikimedia.org/r/885425

Change 885425 abandoned by Bking:

[wikidata/query/deploy@master] deploy: add options for different k8s staging envs

Reason:

not needed

https://gerrit.wikimedia.org/r/885425

MPhamWMF assigned this task to bking.
MPhamWMF set the point value for this task to 5.

Reminder to self: better estimate on how long it would take to delete flink HA storage path from swift.

Change 889155 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: Use S3 instead of Swift for bucket access

https://gerrit.wikimedia.org/r/889155

Mentioned in SAL (#wikimedia-operations) [2023-02-14T15:50:33Z] <inflatador> bking@deploy1002 'deploying rdf-streaming-updater prod eqiad T304914'

We attempted the procedure today, but WCQS cannot create a savepoint without hitting an OOM condition
kubectl get pods -o json | jq '.items[].status.containerStatuses[0].lastState'

{
  "terminated": {
    "containerID": "docker://bdb8d22ec09eecf26fbed2c97453a3323bd33e7fc105497f41910665aa9bbe8a",
    "exitCode": 137,
    "finishedAt": "2023-02-14T16:00:01Z",
    "reason": "OOMKilled",
    "startedAt": "2023-01-24T11:01:18Z"
  }
}
{
  "terminated": {
    "containerID": "docker://e4c60e609170c01717851a6eaf05652f2080ee83fb3eb11a33c3f87fcf293710",
    "exitCode": 137,
    "finishedAt": "2023-02-14T16:00:50Z",
    "reason": "OOMKilled",
    "startedAt": "2023-01-24T11:01:19Z"
  }
}
{
  "terminated": {
    "containerID": "docker://dc24a25acac9d32ebd80c4957c6dfe0d0e9bcd010e142dadcff846d6dc66e93b",
    "exitCode": 137,
    "finishedAt": "2023-02-14T16:03:05Z",
    "reason": "OOMKilled",
    "startedAt": "2023-01-24T11:01:19Z"
  }
}
{
  "terminated": {
    "containerID": "docker://7a229041ff346bcdf9c75a0a0813252645845cafb379a8b6af08f3809280d064",
    "exitCode": 137,
    "finishedAt": "2023-02-14T16:00:27Z",
    "reason": "OOMKilled",
    "startedAt": "2023-01-24T11:01:18Z"
  }
}

Change 889155 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: Use S3 instead of Swift for bucket access

https://gerrit.wikimedia.org/r/889155

This is complete...moving to "needs reporting"