Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Jun 21 2023, 7:43 PM

Description

We've encountered a strange behavior of the flink-kubernetes-operator in the staging k8s cluster. After running for some period, it seems the flink operator fails to sync changes to the resources it manages.

As far as we can tell, this only happens in k8s staging.

If we try to do a helmfile -e staging deploy (for mw-page-content-change-enrich), regular k8s resources (Roles, NetworkPolicies, etc.) are deleted, but FlinkDeployment and pods are not.

A manual kubectl delete flinkdeployment ... will hang.

We found that deleting that active flink-operator pod (causing a new one to be created, and a different pod to be elected as leader) will cause things to get unstuck.

For now, if you find that you can't sync flink operator related changes in staging, here is a workaround:

kube_env admin staging
# get the active leader pod (the one that holds the election lease):
kubectl -n flink-operator get lease flink-operator-lease  -o yaml
# ...
  holderIdentity: flink-kubernetes-operator-86b888d6b6-vwgcz
# ...


# delete the active leader pod:
kubectl -n flink-operator delete pod flink-kubernetes-operator-86b888d6b6-vwgcz

After this, you should be able to helmfile -e staging destroy and/or apply.

Related Objects

Mentioned In: T347521: Troubleshoot mw-page-content-change-enrich and flink-operator
T341096: mediawiki-event-enrichment taskmanager crashes at startup
T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content
Mentioned Here: T341096: mediawiki-event-enrichment taskmanager crashes at startup

Event Timeline

Ottomata created this task.Jun 21 2023, 7:43 PM

Restricted Application added a project: Data-Engineering. · View Herald TranscriptJun 21 2023, 7:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ottomata updated the task description. (Show Details)Jun 21 2023, 7:43 PM

Ottomata added a subscriber: JMeybohm.

Did that happen in DSE as well? Are there logs (from the operator, k8s events etc.)?

If we try to do a helmfile -e staging deploy (for mw-page-content-change-enrich), regular k8s resources (Roles, NetworkPolicies, etc.) are deleted, but FlinkDeployment and pods are not.

Those are not part of the deployment and therefore not known to helm. It's the operators responsibility to deal with those.

No, we've only seen it in staging. There are some suspicious events in both the flink-operator and the app namespace about ConfigMap / VolumeMount sync timeouts...although I don't have them on hand now for a paste. Will collect them next time this happens.

Ottomata moved this task from Backlog to To be Estimated/To be discussed on the Event-Platform board.Jun 28 2023, 3:03 PM

Ottomata mentioned this in T307959: [Event Platform] Design and Implement realtime enrichment pipeline for MW page change with content.Jun 28 2023, 3:08 PM

Ottomata triaged this task as High priority.Jun 28 2023, 3:13 PM

Ottomata moved this task from To be Estimated/To be discussed to Estimated/ Discussed on the Event-Platform board.

JArguello-WMF edited projects, added Event-Platform (Sprint 14 B); removed Event-Platform.Jun 29 2023, 1:14 PM

JArguello-WMF moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 10:25 PM

Did that happen in DSE as well? Are there logs (from the operator, k8s events etc.)?

f/up to what @Ottomata said above.
We never had issues in DSE, and we have been able to deploy to codfw and eqiad.

The problems so far seem limited to staging.

JArguello-WMF added a project: Data Engineering and Event Platform Team (Sprint 0).Jun 30 2023, 3:53 PM

JArguello-WMF removed a project: Data-Engineering.Jun 30 2023, 4:00 PM

xcollazo subscribed.Jun 30 2023, 6:10 PM

@JMeybohm did something change in the staging operator deployment?

Right now the system is behaving as expected (I can deploy / destroy mw-page-content-change-enrich), and I can't reproduce the erroneous behavior anymore.

I did not check git/deployments but I don't think anybody apart from you is working with it :)

Right now staging deployments are working, but
mw-page-content-change-enrichis failing to startup in eqiad and codfw.

mw-page-content-change-enrich PODs went offline on both codfw/eqiad on June 30, and failed to resume. I was not able to restart them manually and had to destroy both deployments and reapply changes. Taskmanagers now fail to start with:

{"@timestamp":"2023-07-04T14:59:16.935Z","log.level": "WARN","message":"Discard registration from TaskExecutor flink-app-main-taskmanager-1-1 at (akka.tcp://flink@10.67.161.40:6122/user/rpc/taskmanager_0) because the framework did not recognize it", "ecs.version": "1.2.0","process.thread.name":"flink-akka.actor.default-dispatcher-17","log.logger":"org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager"}

I need to dig deeper into logstash, but it looks like Flink is unable to restore from HA: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.26?id=sQxjDYkB8zcd3ouB51Sq

Investigating.

@dcausse I thought I it might have been related to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/935153 but: