Page MenuHomePhabricator

Investigate why the mw-content-history-reconcile-enrich Flink job failed.
Closed, ResolvedPublic

Description

From @tchin via Slack:

The mw-content-history-reconcile-enrich flink job failed. It looks like the taskmanagers OOM’d. Giving it a restart for now to see if it’ll fix things, but we might want to increase taskmanager replicas again. (Flink HA doesn’t help in this case because it protects against JobManager failures, which I just realized we should also increase replicas of)

Worker flink-app-production-taskmanager-1-3 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-main-container(exitCode=137, reason=OOMKilled, message=null)]

Original slack thread.

Incident report @tchin put together: https://wikitech.wikimedia.org/wiki/Incidents/2025-03-01_mw-content-history-reconcile-enrich.

On this task we should:

  • Document root cause and finish incident above.
  • Attach any MRs / Patchsets related to the incident.

Details

Other Assignee
xcollazo
TitleReferenceAuthorSource BranchDest Branch
analytics: Keep inconsistent_rows data for 180 days.repos/data-engineering/airflow-dags!1098xcollazobump-inconsistencies-retentionmain
Customize query in GitLab

Event Timeline

Change #1124504 had a related patch set uploaded (by TChin; author: TChin):

[operations/deployment-charts@master] mw-content-history-reconcile-enrich: Bump taskmanager memory

https://gerrit.wikimedia.org/r/1124504

Mentioned in SAL (#wikimedia-operations) [2025-03-04T19:32:46Z] <xcollazo@deploy2002> Started deploy [airflow-dags/analytics@10615c9]: Deploy latet DAGs for analytics Airflow instance. T387906.

Mentioned in SAL (#wikimedia-analytics) [2025-03-04T19:33:12Z] <xcollazo> Deploy latet DAGs for analytics Airflow instance. T387906.

Mentioned in SAL (#wikimedia-operations) [2025-03-04T19:33:20Z] <xcollazo@deploy2002> Finished deploy [airflow-dags/analytics@10615c9]: Deploy latet DAGs for analytics Airflow instance. T387906. (duration: 00m 34s)

Change #1124504 merged by jenkins-bot:

[operations/deployment-charts@master] mw-content-history-reconcile-enrich: Bump taskmanager memory

https://gerrit.wikimedia.org/r/1124504