Provide image that's based on version 1.12.0 (June 25) of the operator. This version already supports the config file (formerly flink-config.yaml) format updates (wich will be enforced from flink 2.0 on), so we can check compatibility early on.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T376812 EPIC: Update flink jobs to support Flink 1.20 | |||
| Resolved | brouberol | T398162 Flink: Update k8s operator to 1.12.0 |
Event Timeline
This is becoming increasingly important, as we have an incident affecting some of our flink applications: T397330: mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs
We would like to be able to upgrade flink to version 1.20 before troubleshooting further, for which we need this updated flink-operator version.
Change #1172351 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/docker-images/production-images@master] Bump the flink-operator image to version 1.12.0
Change #1172351 merged by Btullis:
[operations/docker-images/production-images@master] Bump the flink-operator image to version 1.12.0
The new operator image has been built and published.
(base) btullis@barracuda:~$ docker run --entrypoint=/bin/bash docker-registry.wikimedia.org/flink-kubernetes-operator:1.12.0-wmf1-20250728 Unable to find image 'docker-registry.wikimedia.org/flink-kubernetes-operator:1.12.0-wmf1-20250728' locally 1.12.0-wmf1-20250728: Pulling from flink-kubernetes-operator 7b9a184c33c9: Pull complete 1ed89d7564b9: Pull complete 49e765ce2954: Pull complete ab0b3b9d2fef: Pull complete e561c1bbe6af: Pull complete 91c1bbeda7c8: Pull complete 2e1dd8fa1f91: Pull complete ef2ef1cfb61b: Pull complete Digest: sha256:64be51c06690f3cf3634564ab676f058fe2994533174fdc74c6803655ee1b853 flink@00d724aea59f:/opt/flink-kubernetes-operator$ find . ./lib ./lib/flink-kubernetes-standalone-1.12.0.jar ./lib/flink-kubernetes-operator-1.12.0-shaded.jar ./lib/ecs-logging-core-1.5.0.jar ./lib/log4j2-ecs-layout-1.5.0.jar ./webhook ./webhook/flink-kubernetes-webhook-1.12.0-shaded.jar flink@00d724aea59f:/opt/flink-kubernetes-operator$
I'll now work on updating the CRDs and the helm chart.
Change #1173403 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/deployment-charts@master] Upgrade the flink-operator CRDs to match the upstream resease v1.12
Change #1173407 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/deployment-charts@master] Update flink-operator helm chart to match the upstream release 1.12
Change #1173403 merged by jenkins-bot:
[operations/deployment-charts@master] Upgrade the flink-operator CRDs to match the upstream resease v1.12
Change #1173407 merged by jenkins-bot:
[operations/deployment-charts@master] Update flink-operator helm chart to match the upstream release v1.12
The flink-operator has been upgraded to version 1.12.0 in both the staging-codfw and staging-eqiad clusters.
NAME: flink-operator LAST DEPLOYED: Wed Aug 6 11:14:21 2025 NAMESPACE: flink-operator STATUS: deployed REVISION: 3 TEST SUITE: None Listing releases matching ^flink-operator$ flink-operator flink-operator 3 2025-08-06 11:14:21.558594916 +0000 UTC deployed flink-kubernetes-operator-2.4.2 1.12.0 UPDATED RELEASES: NAME NAMESPACE CHART VERSION DURATION flink-operator flink-operator wmf-stable/flink-kubernetes-operator 2.4.2 57s
So it's now available for testing flinkdeployments version 1.18 to 2.0.
I will attempt to test the mw-page-content-change-enrich flink application on it.
@Ottomata mentioned in this thread:
mw-page-content-change-enrich should be good for that in staging. it uses kafka-test clusters, so you’d have to pipe some fake data to the source topics to get it to do anything real
@gmodena also shared this:
the easiest way to test it to just kafkacat mediawiki.page_change.v1 from kafka-main to mediawiki.page_change.v1 in kafka-test. The app will pick up new messages and produce into mediawiki.page_content_change.v1 (on kafka-test). You'll hit production Action APIs, but a resonable load (around 20-25 rps) should be fine
The first test is to make sure that the operator restarts it correctly if the flinkdeployment goes away.
btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get flinkdeployment NAME JOB STATUS LIFECYCLE STATE flink-app-main RUNNING STABLE btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl delete flinkdeployment flink-app-main flinkdeployment.flink.apache.org "flink-app-main" deleted btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get events -w LAST SEEN TYPE REASON OBJECT MESSAGE 21s Normal Cleanup flinkdeployment/flink-app-main Cleaning up FlinkDeployment 20s Normal Killing pod/flink-app-main-b8f5ffcb5-2qgvg Stopping container flink-app-main-tls-proxy 20s Normal Killing pod/flink-app-main-b8f5ffcb5-2qgvg Stopping container flink-main-container 18s Warning Unhealthy pod/flink-app-main-b8f5ffcb5-2qgvg Readiness probe failed: Get "http://10.64.64.34:1667/healthz": dial tcp 10.64.64.34:1667: connect: connection refused 20s Normal Killing pod/flink-app-main-taskmanager-1-1 Stopping container flink-app-main-tls-proxy 20s Normal Killing pod/flink-app-main-taskmanager-1-1 Stopping container flink-main-container 20s Normal Killing pod/flink-app-main-taskmanager-1-2 Stopping container flink-app-main-tls-proxy 20s Normal Killing pod/flink-app-main-taskmanager-1-2 Stopping container flink-main-container ^Cbtullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get pods No resources found in mw-page-content-change-enrich namespace. btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get events -w LAST SEEN TYPE REASON OBJECT MESSAGE 48s Normal Cleanup flinkdeployment/flink-app-main Cleaning up FlinkDeployment 47s Normal Killing pod/flink-app-main-b8f5ffcb5-2qgvg Stopping container flink-app-main-tls-proxy 47s Normal Killing pod/flink-app-main-b8f5ffcb5-2qgvg Stopping container flink-main-container 45s Warning Unhealthy pod/flink-app-main-b8f5ffcb5-2qgvg Readiness probe failed: Get "http://10.64.64.34:1667/healthz": dial tcp 10.64.64.34:1667: connect: connection refused 47s Normal Killing pod/flink-app-main-taskmanager-1-1 Stopping container flink-app-main-tls-proxy 47s Normal Killing pod/flink-app-main-taskmanager-1-1 Stopping container flink-main-container 47s Normal Killing pod/flink-app-main-taskmanager-1-2 Stopping container flink-app-main-tls-proxy 47s Normal Killing pod/flink-app-main-taskmanager-1-2 Stopping container flink-main-container ^Cbtullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$
It didn't automatically restart the app, but I'm not sure if it should have done.
I roll-restarted it using helmfile sync and it started correctly.
btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ helmfile -e staging --state-values-set roll_restart=1 sync Upgrading release=main, chart=wmf-stable/flink-app, namespace=mw-page-content-change-enrich Release "main" has been upgraded. Happy Helming! NAME: main LAST DEPLOYED: Wed Aug 6 11:44:39 2025 NAMESPACE: mw-page-content-change-enrich STATUS: deployed REVISION: 33 TEST SUITE: None NOTES: Thank you for installing flink-app. Your release is named flink-app-main. To learn more about the release, try: helm status flink-app-main helm get flink-app-main Listing releases matching ^main$ main mw-page-content-change-enrich 33 2025-08-06 11:44:39.773114119 +0000 UTC deployed flink-app-0.4.25 UPDATED RELEASES: NAME NAMESPACE CHART VERSION DURATION main mw-page-content-change-enrich wmf-stable/flink-app 0.4.25 2s
Looking good so far.
I'm happy that the flinkdeployment object was recreated correctly by the operator.
btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl describe flinkdeployments.flink.apache.org flink-app-main
Name: flink-app-main
Namespace: mw-page-content-change-enrich
Labels: app=flink-app
app.kubernetes.io/managed-by=Helm
chart=flink-app-0.4.25
heritage=Helm
release=main
Annotations: meta.helm.sh/release-name: main
meta.helm.sh/release-namespace: mw-page-content-change-enrich
API Version: flink.apache.org/v1beta1
Kind: FlinkDeployment
Metadata:
Creation Timestamp: 2025-08-06T11:44:40Z
Finalizers:
flinkdeployments.flink.apache.org/finalizer
Generation: 2
Resource Version: 357961607
UID: d8c14712-d88a-486b-96b2-44f4c1e029f2It's in a running state at the moment.
btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get flinkdeployment NAME JOB STATUS LIFECYCLE STATE flink-app-main RUNNING STABLE
I'm not really sure that I need to go as far as testing the app with dummy data, since the operator is clearly doing its bit and the app is running.
I think I can go ahead and deploy the operator to production.
@BTullis we were wondering if the new image of the flink-operator was actually deployed everywhere, I checked and I still see in helmfile.d/admin_ng/flink-operator/values.yaml
repository: docker-registry.discovery.wmnet/flink-kubernetes-operator tag: 1.4.0-wmf1-20241013
which I suspect is too old?
I can't connect to kube_env admin to actually check the container images.
On dse-k8s-eqid.
root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image' "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
Indeed, that is odd.
I'll check eqiad and codfw as well.
root@deploy1003:~# kube-env admin staging-codfw root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image' "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" root@deploy1003:~# kube-env admin staging-eqiad root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image' "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" root@deploy1003:~# kube-env admin eqiad root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image' "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" root@deploy1003:~# kube-env admin codfw root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image' "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013" "docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
Ok, this is indeed odd. All Flink operator deployments are running an image that is 1.5y old.
Change #1182515 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-codfw
Change #1182516 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-eqiad
Change #1182517 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in codfw
Change #1182518 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in eqiad
Change #1182519 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in dse-k8s-eqiad
Change #1182520 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: update operator to 1.12 by default
@dcausse I've submitted a stack of patches to update the operator cluster by cluster (and finally set the default value to 1.12 everywhere, assuming all upgrades went fine).
Change #1182515 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-codfw
Change #1182532 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs
Change #1182536 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] WIP
Change #1182536 abandoned by Brouberol:
[operations/deployment-charts@master] WIP
Reason:
I messed up my local git commands.
We were also missing RBAC permissions for the view/deploy ClusterRoles for the flinksessionjobs and flinkstatesnapshots CRDs created in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1173403
Change #1182532 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs
Change #1182549 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] flink-operator: grant the required permissions to manage it own CRDs
Change #1182549 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: grant the required permissions to manage it own CRDs
Change #1182516 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-eqiad
Note: the operator was emitting logs such as
{"@timestamp":"2025-08-27T12:28:02.906Z","log.level": "WARN","message":"FlinkStateSnapshot CRD was not installed, snapshot resources will be disabled!", "ecs.version": "1.2.0","process.thread.name":"main","log.logger":"org.apache.flink.kubernetes.operator.config.FlinkConfigManager"}
caused by the lack of permissions for the operator's ServiceAccount itself for these given CRDs.
After having redeployed the new operator to staging-eqiad, I noticed the following 2 regular warnings in the logs:
{"@timestamp":"2025-08-27T13:45:42.218Z","log.level": "WARN","message":"Config uses deprecated configuration key 'state.savepoints.dir' instead of proper key 'execution.checkpointing.savepoint-dir'", "ecs.version": "1.2.0","process.thread.name":"ReconcilerExecutor-flinkdeploymentcontroller-227","log.logger":"org.apache.flink.configuration.Configuration","resource.apiVersion":"flink.apache.org/v1beta1","resource.generation":"2","resource.kind":"FlinkDeployment","resource.name":"flink-app-wikidata","resource.namespace":"rdf-streaming-updater","resource.resourceVersion":"366455785","resource.uid":"84d65da7-4737-49a3-841b-d1ec5fc70c3e"}{"@timestamp":"2025-08-27T13:45:43.513Z","log.level": "WARN","message":"Config uses deprecated configuration key 'state.checkpoints.dir' instead of proper key 'execution.checkpointing.dir'", "ecs.version": "1.2.0","process.thread.name":"ReconcilerExecutor-flinkdeploymentcontroller-227","log.logger":"org.apache.flink.configuration.Configuration","resource.apiVersion":"flink.apache.org/v1beta1","resource.generation":"2","resource.kind":"FlinkDeployment","resource.name":"flink-app-wikidata","resource.namespace":"rdf-streaming-updater","resource.resourceVersion":"366455785","resource.uid":"84d65da7-4737-49a3-841b-d1ec5fc70c3e"}@dcausse I think this would mandate a configuration update before we proceed with production upgrades.
Unfortunately we can't update this config value yet, the jobs are mostly running 1.17 and it's not there yet, we'll have to wait to upgrade those jobs to 1.20. I think it's fine to continue with this deprecation warning I'm not sure we have another choice.
Change #1182517 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in codfw
Change #1182518 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in eqiad
Change #1182519 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: update operator to 1.12 in dse-k8s-eqiad
Change #1182520 merged by Brouberol:
[operations/deployment-charts@master] flink-operator: update operator to 1.12 by default
@dcausse and I upgraded the flink operator to 1.12 in all kube clusters without hiccups.