Page MenuHomePhabricator

Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments
Closed, ResolvedPublic

Description

We've encountered a strange behavior of the flink-kubernetes-operator in the staging k8s cluster. After running for some period, it seems the flink operator fails to sync changes to the resources it manages.

As far as we can tell, this only happens in k8s staging.

If we try to do a helmfile -e staging deploy (for mw-page-content-change-enrich), regular k8s resources (Roles, NetworkPolicies, etc.) are deleted, but FlinkDeployment and pods are not.

A manual kubectl delete flinkdeployment ... will hang.

We found that deleting that active flink-operator pod (causing a new one to be created, and a different pod to be elected as leader) will cause things to get unstuck.


For now, if you find that you can't sync flink operator related changes in staging, here is a workaround:

kube_env admin staging
# get the active leader pod (the one that holds the election lease):
kubectl -n flink-operator get lease flink-operator-lease  -o yaml
# ...
  holderIdentity: flink-kubernetes-operator-86b888d6b6-vwgcz
# ...


# delete the active leader pod:
kubectl -n flink-operator delete pod flink-kubernetes-operator-86b888d6b6-vwgcz

After this, you should be able to helmfile -e staging destroy and/or apply.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Did that happen in DSE as well? Are there logs (from the operator, k8s events etc.)?

If we try to do a helmfile -e staging deploy (for mw-page-content-change-enrich), regular k8s resources (Roles, NetworkPolicies, etc.) are deleted, but FlinkDeployment and pods are not.

Those are not part of the deployment and therefore not known to helm. It's the operators responsibility to deal with those.

No, we've only seen it in staging. There are some suspicious events in both the flink-operator and the app namespace about ConfigMap / VolumeMount sync timeouts...although I don't have them on hand now for a paste. Will collect them next time this happens.

Did that happen in DSE as well? Are there logs (from the operator, k8s events etc.)?

f/up to what @Ottomata said above.
We never had issues in DSE, and we have been able to deploy to codfw and eqiad.

The problems so far seem limited to staging.

@JMeybohm did something change in the staging operator deployment?

Right now the system is behaving as expected (I can deploy / destroy mw-page-content-change-enrich), and I can't reproduce the erroneous behavior anymore.

I did not check git/deployments but I don't think anybody apart from you is working with it :)

Right now staging deployments are working, but
mw-page-content-change-enrichis failing to startup in eqiad and codfw.

mw-page-content-change-enrich PODs went offline on both codfw/eqiad on June 30, and failed to resume. I was not able to restart them manually and had to destroy both deployments and reapply changes. Taskmanagers now fail to start with:

{"@timestamp":"2023-07-04T14:59:16.935Z","log.level": "WARN","message":"Discard registration from TaskExecutor flink-app-main-taskmanager-1-1 at (akka.tcp://flink@10.67.161.40:6122/user/rpc/taskmanager_0) because the framework did not recognize it", "ecs.version": "1.2.0","process.thread.name":"flink-akka.actor.default-dispatcher-17","log.logger":"org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager"}

I need to dig deeper into logstash, but it looks like Flink is unable to restore from HA: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.26?id=sQxjDYkB8zcd3ouB51Sq

Investigating.

@dcausse I thought I it might have been related to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/935153 but:

  1. the patch was successfully deployed in`staging`.
  2. I rolled back to v1.23.0 and the issue in eqiad/codfw still persists.
  3. HA restore failures start appearing as of June 30.

Keep you posted. I'll open a dedicated phab task if needed.

Keep you posted. I'll open a dedicated phab task if needed.

This HA restore failure is an application issue (not operator related).
See https://phabricator.wikimedia.org/T341096