Page MenuHomePhabricator

Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model
Closed, ResolvedPublic0 Estimated Story Points

Description

The WDQS streaming updater is using a session cluster mode but the work done on T324576 should allow us to use a better approach based on the flink-k8s-operator.

Once the operator available we should migrate the WDQS streaming updater to this new deployment model.

Preparation steps:

  • T304914 fully remove any references to swift:// resources in eqiad
  • T289836 deploy flink 1.16 to the existing flink-session-cluster deployments in wikikube and run the streaming updater on top of it
  • Test the updater job on the dse-k8s cluster
    • create a namespace for the rdf-streaming-updater on the dse-k8s cluster: https://gerrit.wikimedia.org/r/c/operations/puppet/+/882748
    • T328675 create a helmfile service using the FlinkDeployment resource via the flink-app helm chart
    • T341792 Provision Zookeeper Cluster for storing Flink HA data
    • T344614 Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster
    • test various maintenance operations: taking savepoint, job upgrade, H/A recoveries (kill pods manually), k8s upgrade (wipe out the namespace, T293063), ... (see also T342149, T328561)
  • Enable the k8s-operator on the staging wikikube cluster for the rdf-streaming-updater namespace (might need a dedicated task)
    • test various maintenance operations on staging wk: taking savepoint, job upgrade, H/A recoveries (kill pods manually), k8s upgrade (wipe out the namespace, T293063), ... (see also T328561)
  • Enable the k8s-operator on the production wikikube cluster for the rdf-streaming-updater namespace (might need a dedicated task)

Migration steps:

  • eqiad:
    • depool WDQS&WCQS read traffic in eqiad
    • stop the production jobs with a savepoint using the python tools
    • update the k8s chart to set the right savepoint path for both jobs
    • deploy the jobs
    • repool WDQS&WCQS in eqiad
  • codfw:
    • depool WDQS&WCQS read traffic in codfw
    • stop the production jobs with a savepoint using the python tools
    • update the k8s chart to set the right savepoint path for both jobs
    • deploy the jobs
    • repool WDQS&WCQS in codfw

Cleanups: moved to T350784

AC:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

What namespace strategy should we use for flink jobs? A single one for all wmf flink jobs, per team, per project?

One per application is what I'm expecting.

Change 882748 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] dse-k8s: add rdf-streaming-updater namespace

https://gerrit.wikimedia.org/r/882748

Change 882748 merged by Bking:

[operations/puppet@production] dse-k8s: add rdf-streaming-updater namespace

https://gerrit.wikimedia.org/r/882748

create a namespace for the rdf-streaming-updater on the dse-k8s cluster

BTW, I _think_ there is more involved than just this Puppet patch? Reach out to @BTullis ?

Gehel set the point value for this task to 0.May 1 2023, 3:28 PM
dcausse updated the task description. (Show Details)

Removing this as a dependency of deploying the Search Update Pipeline. We have proven that things work enough. This migration will still need to happen, but can be done independently from the Search Update Pipeline.

Gehel triaged this task as Medium priority.Nov 3 2023, 10:28 AM

I'm happy to say the flink operator migration is complete. Commons and wikidata are stable in both CODFW and EQIAD. As such, I'm resolving this ticket. Post-migration cleanup work continues in T350784 .

bking moved this task from In Progress to Done on the Data-Platform-SRE board.

Change #1015343 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Remove flink RBAC snowflakes

https://gerrit.wikimedia.org/r/1015343

Change #1015343 merged by jenkins-bot:

[operations/deployment-charts@master] Remove flink RBAC snowflakes

https://gerrit.wikimedia.org/r/1015343

Change #1017789 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Revert "Remove flink RBAC snowflakes"

https://gerrit.wikimedia.org/r/1017789

Change #1017789 merged by JMeybohm:

[operations/deployment-charts@master] Revert "Remove flink RBAC snowflakes"

https://gerrit.wikimedia.org/r/1017789

Change #1018214 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Revert "Revert "Remove flink RBAC snowflakes""

https://gerrit.wikimedia.org/r/1018214

Change #1018214 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "Revert "Remove flink RBAC snowflakes""

https://gerrit.wikimedia.org/r/1018214