Per #wikimedia-k8s-sig IRC conversation with @gmodena , @BTullis , @dcausse and others:
The mw-page-content-change-enrich app died last night with this error and is not coming back up.
This log in the flink k8s operator container might be related
Per #wikimedia-k8s-sig IRC conversation with @gmodena , @BTullis , @dcausse and others:
The mw-page-content-change-enrich app died last night with this error and is not coming back up.
This log in the flink k8s operator container might be related
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
flink-app: increment chart version number | operations/deployment-charts | master | +1 -1 |
Mentioned in SAL (#wikimedia-operations) [2023-09-27T19:35:41Z] <inflatador> bking@deploy2002 deleting flink-operator leader pod to force failover T347521
Operational steps taken so far:
I haven't found anything useful in the logs yet. At this point, I'm tempted to increment the helm chart version and try to redeploy. @gmodena let me know if you are interested and we can try it tomorrow.
@bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version
Change 961806 had a related patch set uploaded (by Bking; author: Bking):
[operations/deployment-charts@master] flink-app: increment chart version number
Change 961806 abandoned by Bking:
[operations/deployment-charts@master] flink-app: increment chart version number
Reason:
We got the app to run without changing the chart version
Per IRC conversation with @dcausse , the application was in a partially-deployed state (he was able to find this via kubectl get networkpolicy).
Destroying and re-applying the helmfile fixed the issue, and the application is back up and running in eqiad. Thus, I'm closing out this ticket. Thanks to David and everyone else who helped out!