Troubleshoot mw-page-content-change-enrich and flink-operator
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Sep 27 2023, 6:57 PM

Description

Per #wikimedia-k8s-sig IRC conversation with @gmodena , @BTullis , @dcausse and others:

The mw-page-content-change-enrich app died last night with this error and is not coming back up.

This log in the flink k8s operator container might be related

Details

	Subject	Repo	Branch	Lines +/-
	flink-app: increment chart version number	operations/deployment-charts	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned Here: T340059: Flink k8s operator in staging sometimes will not sync changes to FlinkDeployments

Event Timeline

bking created this task.Sep 27 2023, 6:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 27 2023, 6:57 PM

bking moved this task from Incoming to In Progress on the Data-Platform-SRE board.Sep 27 2023, 6:57 PM

Mentioned in SAL (#wikimedia-operations) [2023-09-27T19:35:41Z] <inflatador> bking@deploy2002 deleting flink-operator leader pod to force failover T347521

Operational steps taken so far:

Staging
- helmfile -e staging -i destroy + helmfile -e staging -i apply
  - Motivation: I noticed that the flinkdeployment resource in staging was in failed state.
  - Result: flinkdeployment is healthy again

Production eqiad
- helmfile -e eqiad -i apply
  - Motivation: get the service running again.
  - Result: helmfile prints "comparing release" and exits. No resources are created.
- Failover flink operator pod (see T340059 for exact commands)
  - Motivation: Unstick anything that might be stuck. Is that a technical description? Sure.
  - Result: The operator failed over cleanly, but it did not fix the deployment problem.

I haven't found anything useful in the logs yet. At this point, I'm tempted to increment the helm chart version and try to redeploy. @gmodena let me know if you are interested and we can try it tomorrow.

JMeybohm subscribed.Sep 28 2023, 8:05 AM

@bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version

Change 961806 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] flink-app: increment chart version number

https://gerrit.wikimedia.org/r/961806

gerritbot added a project: Patch-For-Review.Sep 28 2023, 1:51 PM

Change 961806 abandoned by Bking:

[operations/deployment-charts@master] flink-app: increment chart version number

Reason:

We got the app to run without changing the chart version

https://gerrit.wikimedia.org/r/961806

Maintenance_bot removed a project: Patch-For-Review.Sep 28 2023, 2:10 PM

Per IRC conversation with @dcausse , the application was in a partially-deployed state (he was able to find this via kubectl get networkpolicy).

Destroying and re-applying the helmfile fixed the issue, and the application is back up and running in eqiad. Thus, I'm closing out this ticket. Thanks to David and everyone else who helped out!

Troubleshoot mw-page-content-change-enrich and flink-operatorClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Troubleshoot mw-page-content-change-enrich and flink-operator
Closed, ResolvedPublic
Actions