We've encountered a strange behavior of the flink-kubernetes-operator in the staging k8s cluster. After running for some period, it seems the flink operator fails to sync changes to the resources it manages.
As far as we can tell, this only happens in k8s staging.
If we try to do a helmfile -e staging deploy (for mw-page-content-change-enrich), regular k8s resources (Roles, NetworkPolicies, etc.) are deleted, but FlinkDeployment and pods are not.
A manual kubectl delete flinkdeployment ... will hang.
We found that deleting that active flink-operator pod (causing a new one to be created, and a different pod to be elected as leader) will cause things to get unstuck.
For now, if you find that you can't sync flink operator related changes in staging, here is a workaround:
kube_env admin staging # get the active leader pod (the one that holds the election lease): kubectl -n flink-operator get lease flink-operator-lease -o yaml # ... holderIdentity: flink-kubernetes-operator-86b888d6b6-vwgcz # ... # delete the active leader pod: kubectl -n flink-operator delete pod flink-kubernetes-operator-86b888d6b6-vwgcz
After this, you should be able to helmfile -e staging destroy and/or apply.