From @tchin via Slack:
The mw-content-history-reconcile-enrich flink job failed. It looks like the taskmanagers OOM’d. Giving it a restart for now to see if it’ll fix things, but we might want to increase taskmanager replicas again. (Flink HA doesn’t help in this case because it protects against JobManager failures, which I just realized we should also increase replicas of)
Worker flink-app-production-taskmanager-1-3 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-main-container(exitCode=137, reason=OOMKilled, message=null)]
Incident report @tchin put together: https://wikitech.wikimedia.org/wiki/Incidents/2025-03-01_mw-content-history-reconcile-enrich.
On this task we should:
- Document root cause and finish incident above.
- Attach any MRs / Patchsets related to the incident.