This week, as an Ops week alert, we received the following one:
Airflow alert: <TaskInstance: hdfs_usage_weekly.aggregate_and_extract_fsimage_data_to_parquet scheduled__2024-04-29T00:00:00+00:00 [failed] Try 6 out of 6 Exception: application_1713453355160_190101 is not running. Application state: FAILED Log: Link Host: an-launcher1002.eqiad.wmnet Mark success: Link
I took a look at the application log and I only found the following:
24/05/06 06:38:07 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 24/05/06 06:38:07 INFO compress.CodecPool: Got brand-new decompressor [.deflate] Container: container_e121_1713453355160_192254_01_000001 on an-worker1112.eqiad.wmnet_8041_1714977369600 LogAggregationType: AGGREGATED ======================================================================================================== LogType:container-localizer-syslog LogLastModifiedTime:Mon May 06 06:36:09 +0000 2024 LogLength:184 LogContents: 2024-05-06 06:36:07,617 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded. End of LogType:container-localizer-syslog *******************************************************************************************
There is no specific details about the error but the DAG tried 6 out of 6. I retried one more time but it failed again with the same information.