Page MenuHomePhabricator

Flink: Update k8s operator to 1.12.0
Closed, ResolvedPublic3 Estimated Story Points

Description

Provide image that's based on version 1.12.0 (June 25) of the operator. This version already supports the config file (formerly flink-config.yaml) format updates (wich will be enforced from flink 2.0 on), so we can check compatibility early on.

Event Timeline

This is becoming increasingly important, as we have an incident affecting some of our flink applications: T397330: mediawiki.content_history: flink applications experiencing frequent restarts due to JobManager OOMs
We would like to be able to upgrade flink to version 1.20 before troubleshooting further, for which we need this updated flink-operator version.

Change #1172351 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/docker-images/production-images@master] Bump the flink-operator image to version 1.12.0

https://gerrit.wikimedia.org/r/1172351

Change #1172351 merged by Btullis:

[operations/docker-images/production-images@master] Bump the flink-operator image to version 1.12.0

https://gerrit.wikimedia.org/r/1172351

The new operator image has been built and published.

(base) btullis@barracuda:~$ docker run --entrypoint=/bin/bash docker-registry.wikimedia.org/flink-kubernetes-operator:1.12.0-wmf1-20250728
Unable to find image 'docker-registry.wikimedia.org/flink-kubernetes-operator:1.12.0-wmf1-20250728' locally
1.12.0-wmf1-20250728: Pulling from flink-kubernetes-operator
7b9a184c33c9: Pull complete 
1ed89d7564b9: Pull complete 
49e765ce2954: Pull complete 
ab0b3b9d2fef: Pull complete 
e561c1bbe6af: Pull complete 
91c1bbeda7c8: Pull complete 
2e1dd8fa1f91: Pull complete 
ef2ef1cfb61b: Pull complete 
Digest: sha256:64be51c06690f3cf3634564ab676f058fe2994533174fdc74c6803655ee1b853

flink@00d724aea59f:/opt/flink-kubernetes-operator$ find
.
./lib
./lib/flink-kubernetes-standalone-1.12.0.jar
./lib/flink-kubernetes-operator-1.12.0-shaded.jar
./lib/ecs-logging-core-1.5.0.jar
./lib/log4j2-ecs-layout-1.5.0.jar
./webhook
./webhook/flink-kubernetes-webhook-1.12.0-shaded.jar
flink@00d724aea59f:/opt/flink-kubernetes-operator$

I'll now work on updating the CRDs and the helm chart.

Change #1173403 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Upgrade the flink-operator CRDs to match the upstream resease v1.12

https://gerrit.wikimedia.org/r/1173403

Change #1173407 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update flink-operator helm chart to match the upstream release 1.12

https://gerrit.wikimedia.org/r/1173407

Change #1173403 merged by jenkins-bot:

[operations/deployment-charts@master] Upgrade the flink-operator CRDs to match the upstream resease v1.12

https://gerrit.wikimedia.org/r/1173403

Change #1173407 merged by jenkins-bot:

[operations/deployment-charts@master] Update flink-operator helm chart to match the upstream release v1.12

https://gerrit.wikimedia.org/r/1173407

The flink-operator has been upgraded to version 1.12.0 in both the staging-codfw and staging-eqiad clusters.

NAME: flink-operator
LAST DEPLOYED: Wed Aug  6 11:14:21 2025
NAMESPACE: flink-operator
STATUS: deployed
REVISION: 3
TEST SUITE: None

Listing releases matching ^flink-operator$
flink-operator	flink-operator	3       	2025-08-06 11:14:21.558594916 +0000 UTC	deployed	flink-kubernetes-operator-2.4.2	1.12.0     


UPDATED RELEASES:
NAME             NAMESPACE        CHART                                  VERSION   DURATION
flink-operator   flink-operator   wmf-stable/flink-kubernetes-operator   2.4.2          57s

So it's now available for testing flinkdeployments version 1.18 to 2.0.

I will attempt to test the mw-page-content-change-enrich flink application on it.
@Ottomata mentioned in this thread:

mw-page-content-change-enrich should be good for that in staging. it uses kafka-test clusters, so you’d have to pipe some fake data to the source topics to get it to do anything real

@gmodena also shared this:

the easiest way to test it to just kafkacat mediawiki.page_change.v1 from kafka-main to mediawiki.page_change.v1 in kafka-test. The app will pick up new messages and produce into mediawiki.page_content_change.v1 (on kafka-test). You'll hit production Action APIs, but a resonable load (around 20-25 rps) should be fine

The first test is to make sure that the operator restarts it correctly if the flinkdeployment goes away.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get flinkdeployment
NAME             JOB STATUS   LIFECYCLE STATE
flink-app-main   RUNNING      STABLE

btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl delete flinkdeployment flink-app-main
flinkdeployment.flink.apache.org "flink-app-main" deleted

btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get events -w
LAST SEEN   TYPE      REASON      OBJECT                               MESSAGE
21s         Normal    Cleanup     flinkdeployment/flink-app-main       Cleaning up FlinkDeployment
20s         Normal    Killing     pod/flink-app-main-b8f5ffcb5-2qgvg   Stopping container flink-app-main-tls-proxy
20s         Normal    Killing     pod/flink-app-main-b8f5ffcb5-2qgvg   Stopping container flink-main-container
18s         Warning   Unhealthy   pod/flink-app-main-b8f5ffcb5-2qgvg   Readiness probe failed: Get "http://10.64.64.34:1667/healthz": dial tcp 10.64.64.34:1667: connect: connection refused
20s         Normal    Killing     pod/flink-app-main-taskmanager-1-1   Stopping container flink-app-main-tls-proxy
20s         Normal    Killing     pod/flink-app-main-taskmanager-1-1   Stopping container flink-main-container
20s         Normal    Killing     pod/flink-app-main-taskmanager-1-2   Stopping container flink-app-main-tls-proxy
20s         Normal    Killing     pod/flink-app-main-taskmanager-1-2   Stopping container flink-main-container
^Cbtullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get pods
No resources found in mw-page-content-change-enrich namespace.
btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get events -w
LAST SEEN   TYPE      REASON      OBJECT                               MESSAGE
48s         Normal    Cleanup     flinkdeployment/flink-app-main       Cleaning up FlinkDeployment
47s         Normal    Killing     pod/flink-app-main-b8f5ffcb5-2qgvg   Stopping container flink-app-main-tls-proxy
47s         Normal    Killing     pod/flink-app-main-b8f5ffcb5-2qgvg   Stopping container flink-main-container
45s         Warning   Unhealthy   pod/flink-app-main-b8f5ffcb5-2qgvg   Readiness probe failed: Get "http://10.64.64.34:1667/healthz": dial tcp 10.64.64.34:1667: connect: connection refused
47s         Normal    Killing     pod/flink-app-main-taskmanager-1-1   Stopping container flink-app-main-tls-proxy
47s         Normal    Killing     pod/flink-app-main-taskmanager-1-1   Stopping container flink-main-container
47s         Normal    Killing     pod/flink-app-main-taskmanager-1-2   Stopping container flink-app-main-tls-proxy
47s         Normal    Killing     pod/flink-app-main-taskmanager-1-2   Stopping container flink-main-container
^Cbtullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$

It didn't automatically restart the app, but I'm not sure if it should have done.

I roll-restarted it using helmfile sync and it started correctly.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ helmfile -e staging --state-values-set roll_restart=1 sync
Upgrading release=main, chart=wmf-stable/flink-app, namespace=mw-page-content-change-enrich
Release "main" has been upgraded. Happy Helming!
NAME: main
LAST DEPLOYED: Wed Aug  6 11:44:39 2025
NAMESPACE: mw-page-content-change-enrich
STATUS: deployed
REVISION: 33
TEST SUITE: None
NOTES:
Thank you for installing flink-app.

Your release is named flink-app-main.

To learn more about the release, try:

  helm status flink-app-main
  helm get flink-app-main

Listing releases matching ^main$
main	mw-page-content-change-enrich	33      	2025-08-06 11:44:39.773114119 +0000 UTC	deployed	flink-app-0.4.25	           


UPDATED RELEASES:
NAME   NAMESPACE                       CHART                  VERSION   DURATION
main   mw-page-content-change-enrich   wmf-stable/flink-app   0.4.25          2s

Looking good so far.

I'm happy that the flinkdeployment object was recreated correctly by the operator.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl describe flinkdeployments.flink.apache.org flink-app-main 
Name:         flink-app-main
Namespace:    mw-page-content-change-enrich
Labels:       app=flink-app
              app.kubernetes.io/managed-by=Helm
              chart=flink-app-0.4.25
              heritage=Helm
              release=main
Annotations:  meta.helm.sh/release-name: main
              meta.helm.sh/release-namespace: mw-page-content-change-enrich
API Version:  flink.apache.org/v1beta1
Kind:         FlinkDeployment
Metadata:
  Creation Timestamp:  2025-08-06T11:44:40Z
  Finalizers:
    flinkdeployments.flink.apache.org/finalizer
  Generation:        2
  Resource Version:  357961607
  UID:               d8c14712-d88a-486b-96b2-44f4c1e029f2

It's in a running state at the moment.

btullis@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-page-content-change-enrich$ kubectl get flinkdeployment
NAME             JOB STATUS   LIFECYCLE STATE
flink-app-main   RUNNING      STABLE

I'm not really sure that I need to go as far as testing the app with dummy data, since the operator is clearly doing its bit and the app is running.

I think I can go ahead and deploy the operator to production.

Upgraded on codfw and eqiad and dse-k8s-eqiad.

@BTullis we were wondering if the new image of the flink-operator was actually deployed everywhere, I checked and I still see in helmfile.d/admin_ng/flink-operator/values.yaml

repository: docker-registry.discovery.wmnet/flink-kubernetes-operator
tag: 1.4.0-wmf1-20241013

which I suspect is too old?
I can't connect to kube_env admin to actually check the container images.

On dse-k8s-eqid.

root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image'
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"

Indeed, that is odd.

I'll check eqiad and codfw as well.

root@deploy1003:~# kube-env admin staging-codfw
root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image'
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
root@deploy1003:~# kube-env admin staging-eqiad
root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image'
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
root@deploy1003:~# kube-env admin eqiad
root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image'
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
root@deploy1003:~# kube-env admin codfw
root@deploy1003:~# kubectl get pod -n flink-operator -ojson | jq '.items[].spec.containers[0].image'
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"
"docker-registry.discovery.wmnet/flink-kubernetes-operator:1.4.0-wmf1-20241013"

Ok, this is indeed odd. All Flink operator deployments are running an image that is 1.5y old.

Change #1182515 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-codfw

https://gerrit.wikimedia.org/r/1182515

Change #1182516 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-eqiad

https://gerrit.wikimedia.org/r/1182516

Change #1182517 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in codfw

https://gerrit.wikimedia.org/r/1182517

Change #1182518 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in eqiad

https://gerrit.wikimedia.org/r/1182518

Change #1182519 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1182519

Change #1182520 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: update operator to 1.12 by default

https://gerrit.wikimedia.org/r/1182520

@dcausse I've submitted a stack of patches to update the operator cluster by cluster (and finally set the default value to 1.12 everywhere, assuming all upgrades went fine).

Change #1182515 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-codfw

https://gerrit.wikimedia.org/r/1182515

Change #1182532 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs

https://gerrit.wikimedia.org/r/1182532

Change #1182536 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] WIP

https://gerrit.wikimedia.org/r/1182536

Change #1182536 abandoned by Brouberol:

[operations/deployment-charts@master] WIP

Reason:

I messed up my local git commands.

https://gerrit.wikimedia.org/r/1182536

We were also missing RBAC permissions for the view/deploy ClusterRoles for the flinksessionjobs and flinkstatesnapshots CRDs created in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1173403

Change #1182532 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: grant RBAC permissions to view/deploy ClusterRoles on new flink CRDs

https://gerrit.wikimedia.org/r/1182532

Change #1182549 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] flink-operator: grant the required permissions to manage it own CRDs

https://gerrit.wikimedia.org/r/1182549

Change #1182549 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: grant the required permissions to manage it own CRDs

https://gerrit.wikimedia.org/r/1182549

Change #1182516 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in staging-eqiad

https://gerrit.wikimedia.org/r/1182516

Note: the operator was emitting logs such as

{"@timestamp":"2025-08-27T12:28:02.906Z","log.level": "WARN","message":"FlinkStateSnapshot CRD was not installed, snapshot resources will be disabled!", "ecs.version": "1.2.0","process.thread.name":"main","log.logger":"org.apache.flink.kubernetes.operator.config.FlinkConfigManager"}

caused by the lack of permissions for the operator's ServiceAccount itself for these given CRDs.

After having redeployed the new operator to staging-eqiad, I noticed the following 2 regular warnings in the logs:

{"@timestamp":"2025-08-27T13:45:42.218Z","log.level": "WARN","message":"Config uses deprecated configuration key 'state.savepoints.dir' instead of proper key 'execution.checkpointing.savepoint-dir'", "ecs.version": "1.2.0","process.thread.name":"ReconcilerExecutor-flinkdeploymentcontroller-227","log.logger":"org.apache.flink.configuration.Configuration","resource.apiVersion":"flink.apache.org/v1beta1","resource.generation":"2","resource.kind":"FlinkDeployment","resource.name":"flink-app-wikidata","resource.namespace":"rdf-streaming-updater","resource.resourceVersion":"366455785","resource.uid":"84d65da7-4737-49a3-841b-d1ec5fc70c3e"}
{"@timestamp":"2025-08-27T13:45:43.513Z","log.level": "WARN","message":"Config uses deprecated configuration key 'state.checkpoints.dir' instead of proper key 'execution.checkpointing.dir'", "ecs.version": "1.2.0","process.thread.name":"ReconcilerExecutor-flinkdeploymentcontroller-227","log.logger":"org.apache.flink.configuration.Configuration","resource.apiVersion":"flink.apache.org/v1beta1","resource.generation":"2","resource.kind":"FlinkDeployment","resource.name":"flink-app-wikidata","resource.namespace":"rdf-streaming-updater","resource.resourceVersion":"366455785","resource.uid":"84d65da7-4737-49a3-841b-d1ec5fc70c3e"}

@dcausse I think this would mandate a configuration update before we proceed with production upgrades.

After having redeployed the new operator to staging-eqiad, I noticed the following 2 regular warnings in the logs:

{"@timestamp":"2025-08-27T13:45:42.218Z","log.level": "WARN","message":"Config uses deprecated configuration key 'state.savepoints.dir' instead of proper key 'execution.checkpointing.savepoint-dir'", "ecs.version": "1.2.0","process.thread.name":"ReconcilerExecutor-flinkdeploymentcontroller-227","log.logger":"org.apache.flink.configuration.Configuration","resource.apiVersion":"flink.apache.org/v1beta1","resource.generation":"2","resource.kind":"FlinkDeployment","resource.name":"flink-app-wikidata","resource.namespace":"rdf-streaming-updater","resource.resourceVersion":"366455785","resource.uid":"84d65da7-4737-49a3-841b-d1ec5fc70c3e"}
{"@timestamp":"2025-08-27T13:45:43.513Z","log.level": "WARN","message":"Config uses deprecated configuration key 'state.checkpoints.dir' instead of proper key 'execution.checkpointing.dir'", "ecs.version": "1.2.0","process.thread.name":"ReconcilerExecutor-flinkdeploymentcontroller-227","log.logger":"org.apache.flink.configuration.Configuration","resource.apiVersion":"flink.apache.org/v1beta1","resource.generation":"2","resource.kind":"FlinkDeployment","resource.name":"flink-app-wikidata","resource.namespace":"rdf-streaming-updater","resource.resourceVersion":"366455785","resource.uid":"84d65da7-4737-49a3-841b-d1ec5fc70c3e"}

@dcausse I think this would mandate a configuration update before we proceed with production upgrades.

Unfortunately we can't update this config value yet, the jobs are mostly running 1.17 and it's not there yet, we'll have to wait to upgrade those jobs to 1.20. I think it's fine to continue with this deprecation warning I'm not sure we have another choice.

Change #1182517 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in codfw

https://gerrit.wikimedia.org/r/1182517

Change #1182518 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in eqiad

https://gerrit.wikimedia.org/r/1182518

Change #1182519 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: update operator to 1.12 in dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1182519

Change #1182520 merged by Brouberol:

[operations/deployment-charts@master] flink-operator: update operator to 1.12 by default

https://gerrit.wikimedia.org/r/1182520

@dcausse and I upgraded the flink operator to 1.12 in all kube clusters without hiccups.