Page MenuHomePhabricator

Troubleshoot mw-page-content-change-enrich and flink-operator
Closed, ResolvedPublic

Description

Per #wikimedia-k8s-sig IRC conversation with @gmodena , @BTullis , @dcausse and others:

The mw-page-content-change-enrich app died last night with this error and is not coming back up.

This log in the flink k8s operator container might be related

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-09-27T19:35:41Z] <inflatador> bking@deploy2002 deleting flink-operator leader pod to force failover T347521

Operational steps taken so far:

  • Staging
    • helmfile -e staging -i destroy + helmfile -e staging -i apply
      • Motivation: I noticed that the flinkdeployment resource in staging was in failed state.
      • Result: flinkdeployment is healthy again
  • Production eqiad
    • helmfile -e eqiad -i apply
      • Motivation: get the service running again.
      • Result: helmfile prints "comparing release" and exits. No resources are created.
    • Failover flink operator pod (see T340059 for exact commands)
      • Motivation: Unstick anything that might be stuck. Is that a technical description? Sure.
      • Result: The operator failed over cleanly, but it did not fix the deployment problem.

I haven't found anything useful in the logs yet. At this point, I'm tempted to increment the helm chart version and try to redeploy. @gmodena let me know if you are interested and we can try it tomorrow.

@bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version

Change 961806 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-app: increment chart version number

https://gerrit.wikimedia.org/r/961806

Change 961806 abandoned by Bking:

[operations/deployment-charts@master] flink-app: increment chart version number

Reason:

We got the app to run without changing the chart version

https://gerrit.wikimedia.org/r/961806

bking moved this task from In Progress to Done on the Data-Platform-SRE board.

Per IRC conversation with @dcausse , the application was in a partially-deployed state (he was able to find this via kubectl get networkpolicy).

Destroying and re-applying the helmfile fixed the issue, and the application is back up and running in eqiad. Thus, I'm closing out this ticket. Thanks to David and everyone else who helped out!