Page MenuHomePhabricator

Memory errors break the expected output of some Airflow tasks
Closed, ResolvedPublic

Description

The commonswiki_file.py and cassandra.py Spark tasks fail in Airflow, while they execute fine on Analytics clients: containers get killed due to memory limits.

commonswiki_file.py

Execution log excerpt:

[2022-04-29 15:27:04,818] {subprocess.py:78} INFO -      diagnostics: Application application_1650893206204_20962 failed 6 times due to AM Container for appattempt_1650893206204_20962_000006 exited with  exitCode: -104
[2022-04-29 15:27:04,819] {subprocess.py:78} INFO - Failing this attempt.Diagnostics: [2022-04-29 15:27:04.067]Container [pid=8628,containerID=container_e36_1650893206204_20962_06_000001] is running beyond physical memory limits. Current usage: 2.4 GB of 2.4 GB physical memory use
d; 23.0 GB of 5.0 GB virtual memory used. Killing container.

More details at an-airflow1003.eqiad.wmnet:/home/mfossati/commonswiki_file_failure.log.
This results in several broken delta parquets (i.e., directories with one _temporary file), whle lead_image_data_latest and wikidata_data_latest look fine (_SUCCESS file + snappy ones are there, quickly checked the Spark DataFrame with count()and show()).

cassandra.py

Execution log excerpt:

         diagnostics: Application application_1650893206204_21375 failed 6 times due to AM Container for appattempt_1650893206204_21375_000006 exited with  exitCode: -104
Failing this attempt.Diagnostics: [2022-04-29 17:55:34.282]Container [pid=39484,containerID=container_e36_1650893206204_21375_06_000002] is running beyond physical memory limits. Current usage: 2.4 GB of 2.4 GB physical memory used; 6.1 GB of 5.0 GB virtual memory used. Killing container.

More details at an-airflow1003.eqiad.wmnet:/home/mfossati/cassandra_failure.log.
Only the analytics_platform_eng.suggestions Hive table seems present.

Event Timeline

Where can we look at the DAG code, please?

mfossati changed the task status from Open to In Progress.May 5 2022, 8:53 AM