Page MenuHomePhabricator

Figure root cause of silent failures when computing metrics for mediawiki_content_history_v1
Closed, ResolvedPublic

Description

On T386114 we fixed lock issues related to generating metrics.

However, after fixing those, two separate issues emerged:

We now have a separate issue on DAG mw_content_reconcile_mw_content_history_daily:

sudo -u analytics yarn logs -appOwner analytics -applicationId application_1734703658237_1564943 | grep ERROR

25/02/20 14:07:56 ERROR YarnScheduler: Lost executor 26 on an-worker1084.eqiad.wmnet: Container killed by YARN for exceeding physical memory limits. 9.0 GB of 8.8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
25/02/20 14:09:54 ERROR YarnScheduler: Lost executor 38 on an-worker1168.eqiad.wmnet: Container killed by YARN for exceeding physical memory limits. 9.0 GB of 8.8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
25/02/20 14:30:58 ERROR YarnScheduler: Lost executor 25 on analytics1077.eqiad.wmnet: Container killed by YARN for exceeding physical memory limits. 8.8 GB of 8.8 GB physical memory used. Consider boosting spark.executor.memoryOverhead.

Seems simple enough to fix, so will do that as part of this ticket. I have overridden the DagProperties of mw_content_reconcile_mw_content_history_daily to temporarily bump executor_memory from 8GB to 12GB, see if that fixes things.

If so, I'll open a MR with a proper fix.

and also:

...

Although mw_content_reconcile_mw_content_history_daily was able to make progress on the one instance where it was failing with OOM, compute_metrics continues to fail, now with a seemingly silent issue:

cat application_1734703658237_1583772.log | grep ERROR | wc -l
       0

I will investigate this separately.

The OOM issue has a simple fix, to be delivered as part of this ticket. However, we need to root cause and fix the silent failures.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
analytics: bump driver mem for mw_content_reconcile_mw_content_history_daily-compute_metrics.repos/data-engineering/airflow-dags!1206xcollazobump-metrics-driver-memmain
analytics: Bump driver_memory to avoid OOM for mw content daily metrics.repos/data-engineering/airflow-dags!1089xcollazofix-oom-on-metrics-v2main
analytics: Fix OOM on compute_metrics from mw_content_reconcile_mw_content_history_daily.repos/data-engineering/airflow-dags!1083xcollazofix-oom-on-metricsmain
Customize query in GitLab

Event Timeline

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1083

analytics: Fix OOM on compute_metrics from mw_content_reconcile_mw_content_history_daily.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1083

analytics: Fix OOM on compute_metrics from mw_content_reconcile_mw_content_history_daily.

xcollazo renamed this task from Figure root cause of silent failures of when computing metrics for mediawiki_content_history_v1 to Figure root cause of silent failures when computing metrics for mediawiki_content_history_v1.Feb 21 2025, 3:44 PM

Mentioned in SAL (#wikimedia-operations) [2025-02-21T18:33:30Z] <xcollazo@deploy2002> Started deploy [airflow-dags/analytics@60223e2]: Deploying latest DAGs for the analytics Airflow instance. T387033.

Mentioned in SAL (#wikimedia-operations) [2025-02-21T18:34:15Z] <xcollazo@deploy2002> Finished deploy [airflow-dags/analytics@60223e2]: Deploying latest DAGs for the analytics Airflow instance. T387033. (duration: 00m 45s)

Mentioned in SAL (#wikimedia-analytics) [2025-02-21T18:34:28Z] <xcollazo> Deployed latest DAGs for the analytics Airflow instance. T387033.

xcollazo changed the task status from Open to In Progress.Feb 21 2025, 6:44 PM
xcollazo triaged this task as High priority.

After further inspection, the error appears to be a skein application OOM:

sudo -u analytics yarn logs -appOwner analytics -applicationId application_1734703658237_1636365 | tail -n 100

...
25/02/23 04:08:01 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 143. This is often due to the application master memory limit being exceeded. See the diagnostics for more information.
25/02/23 04:08:01 INFO skein.ApplicationMaster: Unregistering application with status FAILED
25/02/23 04:08:01 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
25/02/23 04:08:01 INFO skein.ApplicationMaster: Deleted application directory hdfs://analytics-hadoop/user/analytics/.skein/application_1734703658237_1636365
25/02/23 04:08:01 INFO skein.ApplicationMaster: WebUI server shut down
25/02/23 04:08:01 INFO skein.ApplicationMaster: gRPC server shut down
...

For some reasons the skein app is logging this as INFO. Add it to the reasons I love skein....

I have bumped metrics_driver_memory from 8GB to 16GB and this solves the issue. Will now make this the default via a MR.

Reopening as we had another instance of skein OOM:

xcollazo@an-launcher1002:~$ sudo -u analytics yarn logs -appOwner analytics -applicationId application_1741864027385_383042 | grep "Application driver failed" -B 2 -A 2
...
25/03/31 13:44:16 INFO skein.ApplicationMaster: Registering application with resource manager
25/03/31 13:44:16 INFO skein.ApplicationMaster: Starting application driver
25/03/31 18:00:10 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 143. This is often due to the application master memory limit being exceeded. See the diagnostics for more information.
25/03/31 18:00:10 INFO skein.ApplicationMaster: Unregistering application with status FAILED
25/03/31 18:00:10 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.

Verified that driver has 16GB as per Airflow task details.

Bumping to 24GB manually to rerun, although 24GB for a driver for metrics seems silly. I wonder what dequee is doing in there...

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1206

analytics: bump driver mem for mw_content_reconcile_mw_content_history_daily-compute_metrics.

Reopening as we had another instance of skein OOM:

xcollazo@an-launcher1002:~$ sudo -u analytics yarn logs -appOwner analytics -applicationId application_1741864027385_383042 | grep "Application driver failed" -B 2 -A 2
...
25/03/31 13:44:16 INFO skein.ApplicationMaster: Registering application with resource manager
25/03/31 13:44:16 INFO skein.ApplicationMaster: Starting application driver
25/03/31 18:00:10 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 143. This is often due to the application master memory limit being exceeded. See the diagnostics for more information.
25/03/31 18:00:10 INFO skein.ApplicationMaster: Unregistering application with status FAILED
25/03/31 18:00:10 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.

Verified that driver has 16GB as per Airflow task details.

Bumping to 24GB manually to rerun, although 24GB for a driver for metrics seems silly. I wonder what dequee is doing in there...

I wonder if this line is a factor. withColumn copies the entire df so we're getting 3x the dataframes here.

I wonder if this line is a factor. withColumn copies the entire df so we're getting 3x the dataframes here.

Hmm.. a heavy df indeed but I wouldn't expect that to affect the driver. We can def try to make the whole job faster with the below:

# Analyzers can only be run on columns, so flatten revision_content_slots to support T382953
flattened_df = (df
    .withColumn("revision_content_slots.main.content_body",
                df["revision_content_slots"]["main"]["content_body"])
    .withColumn("revision_content_slots.main.content_format",
                df["revision_content_slots"]["main"]["content_format"])
    .drop("revision_content_slots")         <<<<<<<<<<<<<<<
)