We have tested the flink operator mode in dse-k8s . Our next step is to migrate the staging application to flink operator mode.
Creating this ticket to:
- Migrate the application
- Confirm operation
We have tested the flink operator mode in dse-k8s . Our next step is to migrate the staging application to flink operator mode.
Creating this ticket to:
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | bking | T326409 Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model | |||
| Resolved | bking | T349095 Migrate staging rdf-streaming-updater to flink operator |
@bking and I have been discussing this and we think that the best course of action would be to deploy this in multiple steps. e.g. Something like:
/srv/deployment-charts/helmfile.d/dse-k8s-services/rdf-streaming-updater/$ helmfile -e dse-k8s-eqiad -i destroy
/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i destroy
/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i apply --context=5
We could also deploy via a new namespace, but I wonder what implications that would have for our monitoring/tooling etc. Open to feedback/suggestions on this one.
Create a savepoint by incrementing the nonce value in the helmfile.d/dse-k8s-services/values.yaml and deploy Destroy the deployment on the dse-k8s cluster
/srv/deployment-charts/helmfile.d/dse-k8s-services/rdf-streaming-updater/$ helmfile -e dse-k8s-eqiad -i destroy
Merge the change to delete the deployment from the dse-k8s cluster Destroy the deployment on the staging cluster
/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i destroy
Clone the deployment-charts repo into homedir BEFORE merging the changes. That way, we can cleanly undeploy the production environments.
Merge the change to change the chart in use for the staging deployment, including:
the new savepoint location
the updated chart
the options for zookeeper-ha
TBD
Deploy the updated service to staging/srv/deployment-charts/helmfile.d/services/rdf-streaming-updater/$ helmfile -e staging -i apply --context=5
Mentioned in SAL (#wikimedia-operations) [2023-10-18T15:43:33Z] <inflatador> bking@deploy2002 destroy dse-k8s-services instance of rdf-streaming-updater T349095
Change 966902 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] dse-k8s: remove rdf-streaming-updater service
Change 966921 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] dse-k8s: don't watch rdf-streaming-updater namespace
Change 967229 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] rdf-streaming-updater: update staging values
Change 971221 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] admin_ng: Activate flink-operator for rdf-streaming-updater
Change 971221 merged by jenkins-bot:
[operations/deployment-charts@master] admin_ng: Activate flink-operator for rdf-streaming-updater
Change 972005 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] rbac: permit deploy-flink user to create flinkdeployments
Change 972005 merged by Bking:
[operations/deployment-charts@master] rbac: permit deploy-flink user to create flinkdeployments
Current status:
flink-operator is listening for rdf-streaming-updater
rdf-streaming-updater job deploys, but it seems like it can't connect to kafka:
{"@timestamp":"2023-11-06T23:03:13.111Z","log.level": "INFO","message":"[AdminClient clientId=wcqs_streaming_updater_test:KafkaSource:eqiad.mediawiki.page-suppress-enumerator-admin-client] Disconnecting from node -1 due to socket connection setup timeout. The timeout value is 21318 ms.", "ecs.version": "1.2.0","process.thread.name":"kafka-admin-client-thread | wcqs_streaming_updater_test:KafkaSource:eqiad.mediawiki.page-suppress-enumerator-admin-client","log.logger":"org.apache.kafka.clients.NetworkClient"}Will pick up tomorrow.
Change 972483 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Change 972483 abandoned by Bking:
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Reason:
superseded by changes in 967229
Change 972483 restored by Bking:
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Change 972483 merged by jenkins-bot:
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Change 973242 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Change 973242 merged by jenkins-bot:
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Both apps (commons and wikidata) are stable in staging-eqiad now:
bking@deploy2002:~/deployment-charts$ kubectl get flinkdeployments.flink.apache.org NAME JOB STATUS LIFECYCLE STATE flink-app-commons RUNNING STABLE flink-app-wikidata RUNNING STABLE
Assuming the service remains stable, we should be able to migrate the production rdf-streaming-updater shortly.
Change 975289 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Change 975289 merged by Bking:
[operations/deployment-charts@master] staging-eqiad: raise rdf-streaming-updater quota
Change 978617 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] admin_ng: tell flink-operator to listen to rdf-streaming-updater ns
Change 978617 merged by jenkins-bot:
[operations/deployment-charts@master] admin_ng: tell flink-operator to listen to rdf-streaming-updater ns
Change 978634 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] flink-zk: Activate codfw hosts
Change 978634 merged by Bking:
[operations/puppet@production] flink-zk: Activate codfw hosts
Change 978639 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] flink-zk: Add codfw flink-zk cluster info
Change 978639 merged by Bking:
[operations/puppet@production] flink-zk: Add codfw flink-zk cluster info
Change 967229 merged by jenkins-bot:
[operations/deployment-charts@master] rdf-streaming-updater: update values for application mode
Apologies for the confusion. We have already migrated the rdf-streaming-updater to production, so I'm closing this ticket (which is focused on staging) as well.
Change #966902 merged by jenkins-bot:
[operations/deployment-charts@master] dse-k8s: remove rdf-streaming-updater service
Change #966921 merged by Bking:
[operations/deployment-charts@master] dse-k8s: don't watch rdf-streaming-updater namespace