We're starting to see this problem on the search airflow instance (an-airflow1001).
A simple sensor task might fail with the following context (mail date: Wed, 14 Dec 2022 17:00:19 +0000):
Try 0 out of 5 Exception: Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally? Log: Link Host: an-airflow1001.eqiad.wmnet Log file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00.log Mark success: Link
When looking at the actual state for this task it has succeeded:
*** Reading local file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00/1.log [2022-12-14 16:28:13,261] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:28:13,309] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:28:13,310] {taskinstance.py:841} INFO - -------------------------------------------------------------------------------- [2022-12-14 16:28:13,310] {taskinstance.py:842} INFO - Starting attempt 1 of 5 [2022-12-14 16:28:13,310] {taskinstance.py:843} INFO - -------------------------------------------------------------------------------- [2022-12-14 16:28:13,361] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 16:28:13,365] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310416', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpzo8pxyn_'] [2022-12-14 16:28:14,038] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,038] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32615 [2022-12-14 16:28:14,790] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,788] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 16:28:14,791] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,790] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 16:28:14,925] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,923] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 16:28:15,203] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,202] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:28:15,205] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,205] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:28:15,209] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 16:28:15,262] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 16:28:15,387] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE [2022-12-14 16:28:18,218] {logging_mixin.py:112} INFO - [2022-12-14 16:28:18,217] {local_task_job.py:124} WARNING - Time since last heartbeat(0.03 s) < heartrate(5.0 s), sleeping for 4.97169 s [2022-12-14 16:28:23,199] {logging_mixin.py:112} INFO - [2022-12-14 16:28:23,196] {local_task_job.py:103} INFO - Task exited with return code 0 [2022-12-14 16:31:27,704] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:31:27,765] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 16:31:27,765] {taskinstance.py:841} INFO - [...] -------------------------------------------------------------------------------- [2022-12-14 16:56:59,970] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 16:56:59,970] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310665', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmp03vdcbvm'] [2022-12-14 16:57:00,662] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:00,661] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=25095 [2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,595] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,596] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 16:57:01,696] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,695] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 16:57:01,987] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,986] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:57:01,989] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,988] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 16:57:01,993] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 16:57:02,064] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 16:57:02,219] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE [2022-12-14 16:57:04,388] {logging_mixin.py:112} INFO - [2022-12-14 16:57:04,387] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960811 s [2022-12-14 16:57:09,356] {logging_mixin.py:112} INFO - [2022-12-14 16:57:09,354] {local_task_job.py:103} INFO - Task exited with return code 0 [suspicious hole here, time when the mail was sent] [2022-12-14 17:15:26,675] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 17:15:26,724] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]> [2022-12-14 17:15:26,724] {taskinstance.py:841} INFO - -------------------------------------------------------------------------------- [...] -------------------------------------------------------------------------------- [2022-12-14 18:25:45,431] {taskinstance.py:842} INFO - Starting attempt 1 of 5 [2022-12-14 18:25:45,431] {taskinstance.py:843} INFO - -------------------------------------------------------------------------------- [2022-12-14 18:25:45,479] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00 [2022-12-14 18:25:45,479] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2311270', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpymzjynuf'] [2022-12-14 18:25:46,170] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:46,169] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32566 [2022-12-14 18:25:47,164] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,163] {__init__.py:51} INFO - Using executor LocalExecutor [2022-12-14 18:25:47,165] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,164] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py [2022-12-14 18:25:47,282] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,281] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet [2022-12-14 18:25:47,558] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,557] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083 [2022-12-14 18:25:47,559] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,559] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083 [2022-12-14 18:25:47,563] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15 [2022-12-14 18:25:47,641] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15 [2022-12-14 18:25:47,712] {base_sensor_operator.py:123} INFO - Success criteria met. Exiting. [2022-12-14 18:25:50,354] {logging_mixin.py:112} INFO - [2022-12-14 18:25:50,353] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.955724 s [2022-12-14 18:25:55,320] {logging_mixin.py:112} INFO - [2022-12-14 18:25:55,316] {local_task_job.py:103} INFO - Task exited with return code 0
Note the hole between 2022-12-14 16:57:09 and 2022-12-14 17:15:26 (when the email was sent), with a poke interval at 3minutes the sensor should have been queued at 2022-12-14 17:00:09 so something prevented it from running at the expected time.
Full logs: P42712
Possible cause: https://github.com/apache/airflow/issues/10790
Possible workarounds:
- increase poke_interval to 5mins or more
- reduce the load on the machine (provision a bigger instance?)