Page MenuHomePhabricator

Migrate staging rdf-streaming-updater to flink operator
Closed, ResolvedPublic13 Estimated Story Points

Description

We have tested the flink operator mode in dse-k8s . Our next step is to migrate the staging application to flink operator mode.

Creating this ticket to:

  • Migrate the application
  • Confirm operation

Event Timeline

@bking and I have been discussing this and we think that the best course of action would be to deploy this in multiple steps. e.g. Something like:

  • Create a savepoint by incrementing the nonce value in the helmfile.d/dse-k8s-services/values.yaml and deploy
  • Destroy the deployment on the dse-k8s cluster
/srv/deployment-charts/helmfile.d/dse-k8s-services/rdf-streaming-updater/$ helmfile -e dse-k8s-eqiad -i destroy
  • Merge the change to delete the deployment from the dse-k8s cluster
  • Destroy the deployment on the staging cluster
/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i destroy
  • Merge the change to change the chart in use for the staging deployment, including:
    • the new savepoint location
    • the updated chart
    • the options for zookeeper-ha
    • TBD
  • Deploy the updated service to staging
/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i apply --context=5

We could also deploy via a new namespace, but I wonder what implications that would have for our monitoring/tooling etc. Open to feedback/suggestions on this one.

Create a savepoint by incrementing the nonce value in the helmfile.d/dse-k8s-services/values.yaml and deploy
Destroy the deployment on the dse-k8s cluster

/srv/deployment-charts/helmfile.d/dse-k8s-services/rdf-streaming-updater/$ helmfile -e dse-k8s-eqiad -i destroy

Merge the change to delete the deployment from the dse-k8s cluster
Destroy the deployment on the staging cluster

/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i destroy

Clone the deployment-charts repo into homedir BEFORE merging the changes. That way, we can cleanly undeploy the production environments.

Merge the change to change the chart in use for the staging deployment, including:
    the new savepoint location
    the updated chart
    the options for zookeeper-ha
    TBD
Deploy the updated service to staging

/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i apply --context=5

Mentioned in SAL (#wikimedia-operations) [2023-10-18T15:43:33Z] <inflatador> bking@deploy2002 destroy dse-k8s-services instance of rdf-streaming-updater T349095

Change 966902 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] dse-k8s: remove rdf-streaming-updater service

https://gerrit.wikimedia.org/r/966902

Change 966921 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] dse-k8s: don't watch rdf-streaming-updater namespace

https://gerrit.wikimedia.org/r/966921

Change 967229 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rdf-streaming-updater: update staging values

https://gerrit.wikimedia.org/r/967229

EBernhardson set the point value for this task to 8.Oct 23 2023, 3:37 PM
EBernhardson changed the point value for this task from 8 to 13.

Change 971221 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] admin_ng: Activate flink-operator for rdf-streaming-updater

https://gerrit.wikimedia.org/r/971221

Gehel triaged this task as Medium priority.Nov 3 2023, 10:27 AM

Change 971221 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Activate flink-operator for rdf-streaming-updater

https://gerrit.wikimedia.org/r/971221

Change 972005 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] rbac: permit deploy-flink user to create flinkdeployments

https://gerrit.wikimedia.org/r/972005

Change 972005 merged by Bking:

[operations/deployment-charts@master] rbac: permit deploy-flink user to create flinkdeployments

https://gerrit.wikimedia.org/r/972005

Current status:

flink-operator is listening for rdf-streaming-updater
rdf-streaming-updater job deploys, but it seems like it can't connect to kafka:

{"@timestamp":"2023-11-06T23:03:13.111Z","log.level": "INFO","message":"[AdminClient clientId=wcqs_streaming_updater_test:KafkaSource:eqiad.mediawiki.page-suppress-enumerator-admin-client] Disconnecting from node -1 due to socket connection setup timeout. The timeout value is 21318 ms.", "ecs.version": "1.2.0","process.thread.name":"kafka-admin-client-thread | wcqs_streaming_updater_test:KafkaSource:eqiad.mediawiki.page-suppress-enumerator-admin-client","log.logger":"org.apache.kafka.clients.NetworkClient"}

Will pick up tomorrow.

Change 972483 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/972483

Change 972483 abandoned by Bking:

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

Reason:

superseded by changes in 967229

https://gerrit.wikimedia.org/r/972483

Change 972483 restored by Bking:

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/972483

Change 972483 merged by jenkins-bot:

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/972483

Change 973242 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/973242

Change 973242 merged by jenkins-bot:

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/973242

Both apps (commons and wikidata) are stable in staging-eqiad now:

bking@deploy2002:~/deployment-charts$ kubectl get flinkdeployments.flink.apache.org
NAME                 JOB STATUS   LIFECYCLE STATE
flink-app-commons    RUNNING      STABLE
flink-app-wikidata   RUNNING      STABLE

Assuming the service remains stable, we should be able to migrate the production rdf-streaming-updater shortly.

Change 975289 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/975289

Change 975289 merged by Bking:

[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota

https://gerrit.wikimedia.org/r/975289

Change 978617 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] admin_ng: tell flink-operator to listen to rdf-streaming-updater ns

https://gerrit.wikimedia.org/r/978617

Change 978617 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: tell flink-operator to listen to rdf-streaming-updater ns

https://gerrit.wikimedia.org/r/978617

Change 978634 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: Activate codfw hosts

https://gerrit.wikimedia.org/r/978634

Change 978634 merged by Bking:

[operations/puppet@production] flink-zk: Activate codfw hosts

https://gerrit.wikimedia.org/r/978634

Change 978639 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] flink-zk: Add codfw flink-zk cluster info

https://gerrit.wikimedia.org/r/978639

Change 978639 merged by Bking:

[operations/puppet@production] flink-zk: Add codfw flink-zk cluster info

https://gerrit.wikimedia.org/r/978639

Change 967229 merged by jenkins-bot:

[operations/deployment-charts@master] rdf-streaming-updater: update values for application mode

https://gerrit.wikimedia.org/r/967229

bking moved this task from In Progress to Done on the Data-Platform-SRE board.

Apologies for the confusion. We have already migrated the rdf-streaming-updater to production, so I'm closing this ticket (which is focused on staging) as well.

Change #966902 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s: remove rdf-streaming-updater service

https://gerrit.wikimedia.org/r/966902

Change #966921 merged by Bking:

[operations/deployment-charts@master] dse-k8s: don't watch rdf-streaming-updater namespace

https://gerrit.wikimedia.org/r/966921