Page MenuHomePhabricator

[airflow] Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally?
Open, LowPublic

Description

We're starting to see this problem on the search airflow instance (an-airflow1001).
A simple sensor task might fail with the following context (mail date: Wed, 14 Dec 2022 17:00:19 +0000):

Try 0 out of 5
Exception:
Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally?
Log: Link
Host: an-airflow1001.eqiad.wmnet
Log file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00.log
Mark success: Link

When looking at the actual state for this task it has succeeded:

*** Reading local file: /var/log/airflow/mediawiki_revision_recommendation_create_hourly/wait_for_data/2022-12-14T15:00:00+00:00/1.log
[2022-12-14 16:28:13,261] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]>
[2022-12-14 16:28:13,309] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]>
[2022-12-14 16:28:13,310] {taskinstance.py:841} INFO - 
--------------------------------------------------------------------------------
[2022-12-14 16:28:13,310] {taskinstance.py:842} INFO - Starting attempt 1 of 5
[2022-12-14 16:28:13,310] {taskinstance.py:843} INFO - 
--------------------------------------------------------------------------------
[2022-12-14 16:28:13,361] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00
[2022-12-14 16:28:13,365] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310416', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpzo8pxyn_']
[2022-12-14 16:28:14,038] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,038] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32615
[2022-12-14 16:28:14,790] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,788] {__init__.py:51} INFO - Using executor LocalExecutor
[2022-12-14 16:28:14,791] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,790] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py
[2022-12-14 16:28:14,925] {base_task_runner.py:115} INFO - Job 2310416: Subtask wait_for_data [2022-12-14 16:28:14,923] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet
[2022-12-14 16:28:15,203] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,202] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083
[2022-12-14 16:28:15,205] {logging_mixin.py:112} INFO - [2022-12-14 16:28:15,205] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083
[2022-12-14 16:28:15,209] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15
[2022-12-14 16:28:15,262] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15
[2022-12-14 16:28:15,387] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2022-12-14 16:28:18,218] {logging_mixin.py:112} INFO - [2022-12-14 16:28:18,217] {local_task_job.py:124} WARNING - Time since last heartbeat(0.03 s) < heartrate(5.0 s), sleeping for 4.97169 s
[2022-12-14 16:28:23,199] {logging_mixin.py:112} INFO - [2022-12-14 16:28:23,196] {local_task_job.py:103} INFO - Task exited with return code 0
[2022-12-14 16:31:27,704] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]>
[2022-12-14 16:31:27,765] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]>
[2022-12-14 16:31:27,765] {taskinstance.py:841} INFO - 

[...]

--------------------------------------------------------------------------------
[2022-12-14 16:56:59,970] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00
[2022-12-14 16:56:59,970] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2310665', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmp03vdcbvm']
[2022-12-14 16:57:00,662] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:00,661] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=25095
[2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,595] {__init__.py:51} INFO - Using executor LocalExecutor
[2022-12-14 16:57:01,597] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,596] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py
[2022-12-14 16:57:01,696] {base_task_runner.py:115} INFO - Job 2310665: Subtask wait_for_data [2022-12-14 16:57:01,695] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet
[2022-12-14 16:57:01,987] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,986] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083
[2022-12-14 16:57:01,989] {logging_mixin.py:112} INFO - [2022-12-14 16:57:01,988] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083
[2022-12-14 16:57:01,993] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15
[2022-12-14 16:57:02,064] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15
[2022-12-14 16:57:02,219] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2022-12-14 16:57:04,388] {logging_mixin.py:112} INFO - [2022-12-14 16:57:04,387] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.960811 s
[2022-12-14 16:57:09,356] {logging_mixin.py:112} INFO - [2022-12-14 16:57:09,354] {local_task_job.py:103} INFO - Task exited with return code 0

[suspicious hole here, time when the mail was sent]
 
[2022-12-14 17:15:26,675] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]>
[2022-12-14 17:15:26,724] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [queued]>
[2022-12-14 17:15:26,724] {taskinstance.py:841} INFO - 
--------------------------------------------------------------------------------

[...]

--------------------------------------------------------------------------------
[2022-12-14 18:25:45,431] {taskinstance.py:842} INFO - Starting attempt 1 of 5
[2022-12-14 18:25:45,431] {taskinstance.py:843} INFO - 
--------------------------------------------------------------------------------
[2022-12-14 18:25:45,479] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-14T15:00:00+00:00
[2022-12-14 18:25:45,479] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-14T15:00:00+00:00', '--job_id', '2311270', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpymzjynuf']
[2022-12-14 18:25:46,170] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:46,169] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=32566
[2022-12-14 18:25:47,164] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,163] {__init__.py:51} INFO - Using executor LocalExecutor
[2022-12-14 18:25:47,165] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,164] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py
[2022-12-14 18:25:47,282] {base_task_runner.py:115} INFO - Job 2311270: Subtask wait_for_data [2022-12-14 18:25:47,281] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-14T15:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet
[2022-12-14 18:25:47,558] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,557] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083
[2022-12-14 18:25:47,559] {logging_mixin.py:112} INFO - [2022-12-14 18:25:47,559] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083
[2022-12-14 18:25:47,563] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=14/hour=15
[2022-12-14 18:25:47,641] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=14/hour=15
[2022-12-14 18:25:47,712] {base_sensor_operator.py:123} INFO - Success criteria met. Exiting.
[2022-12-14 18:25:50,354] {logging_mixin.py:112} INFO - [2022-12-14 18:25:50,353] {local_task_job.py:124} WARNING - Time since last heartbeat(0.04 s) < heartrate(5.0 s), sleeping for 4.955724 s
[2022-12-14 18:25:55,320] {logging_mixin.py:112} INFO - [2022-12-14 18:25:55,316] {local_task_job.py:103} INFO - Task exited with return code 0

Note the hole between 2022-12-14 16:57:09 and 2022-12-14 17:15:26 (when the email was sent), with a poke interval at 3minutes the sensor should have been queued at 2022-12-14 17:00:09 so something prevented it from running at the expected time.
Full logs: P42712

Possible cause: https://github.com/apache/airflow/issues/10790
Possible workarounds:

  • increase poke_interval to 5mins or more
  • reduce the load on the machine (provision a bigger instance?)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse renamed this task from [airfflow] Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally? to [airflow] Executor reports task instance finished (failed) although the task says its queued. Was the task killed externally?.Dec 15 2022, 1:55 PM
dcausse updated the task description. (Show Details)

Might also be related to T325278, the number of allowed connection has been increased from 250 to 350, I suppose that we could wait a bit to see if it improves things.

dcausse reopened this task as Open.
dcausse claimed this task.
dcausse removed dcausse as the assignee of this task.
dcausse moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Happened again with an alert sent on Tue, 20 Dec 2022 04:47:36 +0000 for mediawiki_revision_recommendation_create_hourly on run 2022-12-20T02:00:00+00:00.
Nothing suspicious was found in the logs and unlike the case mentionned in the description the sensor was scheduled and launched at the right time:

[2022-12-20 04:47:19,524] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-20T02:00:00+00:00 [queued]>
[2022-12-20 04:47:19,577] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-20T02:00:00+00:00 [queued]>
[2022-12-20 04:47:19,577] {taskinstance.py:841} INFO - 
--------------------------------------------------------------------------------
[2022-12-20 04:47:19,578] {taskinstance.py:842} INFO - Starting attempt 1 of 5
[2022-12-20 04:47:19,578] {taskinstance.py:843} INFO - 
--------------------------------------------------------------------------------
[2022-12-20 04:47:19,625] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2022-12-20T02:00:00+00:00
[2022-12-20 04:47:19,626] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2022-12-20T02:00:00+00:00', '--job_id', '2376340', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpund_nhnu']
[2022-12-20 04:47:20,172] {base_task_runner.py:115} INFO - Job 2376340: Subtask wait_for_data [2022-12-20 04:47:20,171] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=24322
[2022-12-20 04:47:21,008] {base_task_runner.py:115} INFO - Job 2376340: Subtask wait_for_data [2022-12-20 04:47:21,006] {__init__.py:51} INFO - Using executor LocalExecutor
[2022-12-20 04:47:21,009] {base_task_runner.py:115} INFO - Job 2376340: Subtask wait_for_data [2022-12-20 04:47:21,008] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py
[2022-12-20 04:47:21,102] {base_task_runner.py:115} INFO - Job 2376340: Subtask wait_for_data [2022-12-20 04:47:21,102] {cli.py:545} INFO - Running <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-20T02:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet
[2022-12-20 04:47:21,351] {logging_mixin.py:112} INFO - [2022-12-20 04:47:21,350] {hive_hooks.py:554} INFO - Trying to connect to analytics-hive.eqiad.wmnet:9083
[2022-12-20 04:47:21,353] {logging_mixin.py:112} INFO - [2022-12-20 04:47:21,352] {hive_hooks.py:556} INFO - Connected to analytics-hive.eqiad.wmnet:9083
[2022-12-20 04:47:21,355] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=eqiad/year=2022/month=12/day=20/hour=2
[2022-12-20 04:47:21,423] {named_hive_partition_sensor.py:92} INFO - Poking for event.mediawiki_revision_recommendation_create/datacenter=codfw/year=2022/month=12/day=20/hour=2
[2022-12-20 04:47:21,563] {taskinstance.py:1054} INFO - Rescheduling task, marking task as UP_FOR_RESCHEDULE
[2022-12-20 04:47:24,476] {logging_mixin.py:112} INFO - [2022-12-20 04:47:24,475] {local_task_job.py:124} WARNING - Time since last heartbeat(0.03 s) < heartrate(5.0 s), sleeping for 4.965594 s
[2022-12-20 04:47:29,449] {logging_mixin.py:112} INFO - [2022-12-20 04:47:29,447] {local_task_job.py:103} INFO - Task exited with return code 0
[2022-12-20 04:50:32,110] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-20T02:00:00+00:00 [queued]>
[2022-12-20 04:50:32,161] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2022-12-20T02:00:00+00:00 [queued]>

Nothing interesting was found in /var/log/airflow/scheduler/2022-12-20/mediawiki_revision_recommendation_create.py.log.
Checking the status of the mysql host and airflow host does bring much insight the machines do not seem to be particularly overloaded at this time...

@EBernhardson I'm out of clue here, I feel that we're hit by a race condition that might arise from time to time no matter what we do, perhaps tuning the poke_interval might help a bit (and perhaps spreading the hourly sensors a bit to not start at the same minute?).

Poked at the scheduler log as well to see if anything interesting is there. Starting at around 17:00 many dag processing attempts result in failure:

[2022-12-14 17:00:11,162] {dag_processing.py:1286} INFO - Finding 'running' jobs without a recent heartbeat
[2022-12-14 17:00:11,162] {dag_processing.py:1290} INFO - Failing jobs without heartbeat after 2022-12-14 16:55:11.162802+00:00
[2022-12-14 17:00:14,228] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/glent_weekly.py exited with return code 1.
[2022-12-14 17:00:14,228] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/process_sparql_query.py exited with return code 1.
[2022-12-14 17:00:17,297] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/fulltext_head_queries.py exited with return code 1.
[2022-12-14 17:00:21,386] {dag_processing.py:1286} INFO - Finding 'running' jobs without a recent heartbeat
[2022-12-14 17:00:21,387] {dag_processing.py:1290} INFO - Failing jobs without heartbeat after 2022-12-14 16:55:21.387129+00:00
[2022-12-14 17:00:21,406] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/ores_predictions.py exited with return code 1.
[2022-12-14 17:00:25,483] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/subgraph_and_query_mapping.py exited with return code 1.
[2022-12-14 17:00:29,573] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/export_queries_to_relforge.py exited with return code 1.
[2022-12-14 17:00:30,597] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/ores_predictions.py exited with return code 1.
[2022-12-14 17:00:31,621] {dag_processing.py:1286} INFO - Finding 'running' jobs without a recent heartbeat
[2022-12-14 17:00:31,622] {dag_processing.py:1290} INFO - Failing jobs without heartbeat after 2022-12-14 16:55:31.622287+00:00
[2022-12-14 17:00:31,642] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/rdf_streaming_updater_reconcile.py exited with return code 1.
[2022-12-14 17:00:32,657] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/process_sparql_query.py exited with return code 1.
[2022-12-14 17:00:33,692] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/subgraph_and_query_metrics.py exited with return code 1.
[2022-12-14 17:00:33,692] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/cirrus_namespace_map_daily.py exited with return code 1.
[2022-12-14 17:00:34,715] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/subgraph_and_query_mapping.py exited with return code 1.
[2022-12-14 17:00:34,716] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/popularity_score.py exited with return code 1.
[2022-12-14 17:00:35,740] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/import_ttl.py exited with return code 1.
[2022-12-14 17:00:35,741] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/fulltext_head_queries.py exited with return code 1.
[2022-12-14 17:00:36,765] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/search_satisfaction.py exited with return code 1.
[2022-12-14 17:00:36,766] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mjolnir.py exited with return code 1.
[2022-12-14 17:00:37,787] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/transfer_to_es.py exited with return code 1.
[2022-12-14 17:00:37,788] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mediawiki_revision_recommendation_create.py exited with return code 1.
[2022-12-14 17:00:38,812] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/import_cirrus_indexes.py exited with return code 1.
[2022-12-14 17:00:38,813] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/export_queries_to_relforge.py exited with return code 1.
[2022-12-14 17:00:39,837] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/ores_predictions.py exited with return code 1.
[2022-12-14 17:00:39,838] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/query_clicks.py exited with return code 1.
[2022-12-14 17:00:40,861] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/rdf_streaming_updater_reconcile.py exited with return code 1.
[2022-12-14 17:00:40,862] {dag_processing.py:1191} WARNING - Processor for /srv/deployment/wikimedia/discovery/analytics/airflow/dags/drop_old_data_daily.py exited with return code 1.

Then at 17:02 we get a Too Many Connections error from sql:

[2022-12-14 17:02:52,070] {dag_processing.py:920} ERROR - Error removing old import errors
Traceback (most recent call last):
  File "/srv/deployment/search/airflow-cache/revs/4c49df756f561b47f4ce46f74720f4e2265741d0/venv/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2275, in _wrap_pool_connect
    return fn()
  
  [ excluded lengthy middle of stack trace ]

  File "/usr/lib/python3/dist-packages/MySQLdb/connections.py", line 204, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (1040, 'Too many connections')

This is the only time this is found in our dag_processor_manger logs (that cover dec 4 through jan 5), but it's a surprisingly common error in the individual dags :

(ebernhardson@an-airflow1001)-/var/log/airflow$ grep -cr 'Too many connections' . | awk -F : '{sum += $NF} END {print sum}'
18080

Unclear if this is the problem here, but it is probably a problem we should take care of. The suggestion to extend the poke interval and spread the start times within that interval seems like it would help in this case. I'm not sure the best way to accomplish that, we could manually set the start times but it seems like over time that will become a bit tedious to ensure things are reasonably spread out. Maybe something generic can be done to generate a random offset seeded by some aspect of the task.

Poking at the raw log data a bit with pandas, can also see the errors are not equally distributed, rather they are all clustered up:

2022-11-27       0
2022-11-28       0
2022-11-29    7330
2022-11-30       0
2022-12-02       0
2022-12-03       0
2022-12-04       0
2022-12-05       0
2022-12-06       0
2022-12-07       1
2022-12-08       3
2022-12-09       0
2022-12-10       0
2022-12-11       0                                    
2022-12-12       0
2022-12-13      32
2022-12-14    1667
2022-12-15       0
2022-12-16       0
2022-12-17       0
2022-12-18       0
2022-12-19       0
2022-12-20       0
2022-12-21       0
2022-12-22       0
2022-12-23       0
2022-12-24       0
2022-12-25       0
2022-12-26       0
2022-12-27       0
2022-12-28       0
2022-12-29       0
2022-12-30       0
2022-12-31       0
2023-01-01       0
2023-01-02       0
2023-01-03       0
2023-01-04       0
2023-01-05       0

Notably while the errors happen on the 14th, they don't happen on the 20th where we saw a similar issue. It's also worth remembering this isn't a dedicated sql instance, this is the analytics cluster sql server. So while we might have some effect on the too many connections problem it wouldn't be limited to us, all the different airflow instances are taking up connections along with whatever hadoop does with the instance. Airflow does have configuration for how many connections it will make, but as far as I can tell there is no cross-process coordination there, so each dag processor has it's own pool of sql connections and each executed task again has it's own pool of connections. Limiting there probably isn't going to be much of a solution. It doesn't look like an-coord1001 sends it's mysql logs to logstash, so I'm unable to get much detail on how frequently this occurs outside our own usecase.

Mariadb offers settings to limit the number of concurrent connections both globally and for a single user, both as a system-wide property and as individual user grants. The overall max connections looks to be set to 350 (in modules/profile/templates/analytics/database/meta/analytics-meta.my.cnf.erb), it looks like it was increased on dec 15th in response to the problems on the 14th. per-user connections limits don't seem to be set. If our underlying problem was the mysql connections then this will probably resolve the issue for a year or two.

From our side of things, we could put sensors in their own execution pool? We could have a small pool, say 5 processes, and put all sensors into that pool. Airflow would then make sure we never run too many sensors at the same time. The airflow docs say the processes get queued, and a look at the relevant code (SchedulerJob._find_executable_task_instances) shows they sort by priority_weight and then by execution date. Since we don't set priority weights it should prioritize running the oldest sensors before the newer ones, which is probably reasonable. This would be reasonably easy to do, but I'm unsure if it would have any meaningful impact going forward.

Overall i'd be tempted to close this as resolved by T325278, and perhaps ponder in the future per-user connections limits in mariadb to limit the impact of one user of the db on another.

indeed fixing T325278 did help a lot but we've seen 3 instances of this problem since then:

  • transfer_to_es_hourly.wait_for_recommendations 2022-12-20 02:00:00+00:00
  • transfer_to_es_hourly.wait_for_ores_predictions 2023-01-02 00:00:00+00:00
  • mediawiki_revision_recommendation_create_hourly.wait_for_data 2023-01-05 07:00:00+00:00

And those don't seem to be caused to a particular problem (I found nothing particularly interesting see for instance T325279#8480841 but I might have missed something).
The first instance of this problem I could find in my emails is from August 2022 so it's relatively "new", if configuring a dedicated executor pool is easy enough I guess this might be something we could try to see if it helps?
But happy to close and live with it too, they're easily identifiable with the 0 in "Try 0 out of 5" from the email body :)

Sorry I think I took the wrong sensor logs when looking at T325279#8480841, looking at another instance I see:

(email sent on Thu, 05 Jan 2023 08:46:01 +0000 for mediawiki_revision_recommendation_create_hourly.wait_for_data 2023-01-05 07:00:00+00:00)

mediawiki_revision_recommendation_create_hourly/wait_for_data/2023-01-05T07:00:00+00:00/1.log:

[2023-01-05 08:42:20,862] {logging_mixin.py:112} INFO - [2023-01-05 08:42:20,858] {local_task_job.py:103} INFO - Task exited with return code 0
[2023-01-05 08:45:25,327] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2023-01-05T07:00:00+00:00 [queued]>
[2023-01-05 08:45:25,379] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2023-01-05T07:00:00+00:00 [queued]>
[2023-01-05 08:45:25,379] {taskinstance.py:841} INFO -
--------------------------------------------------------------------------------
[2023-01-05 08:45:25,379] {taskinstance.py:842} INFO - Starting attempt 1 of 5
[2023-01-05 08:45:25,379] {taskinstance.py:843} INFO -

[ Task is not even started?? ]

--------------------------------------------------------------------------------
[2023-01-05 09:00:44,065] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2023-01-05T07:00:00+00:00 [queued]>
[2023-01-05 09:00:44,123] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mediawiki_revision_recommendation_create_hourly.wait_for_data 2023-01-05T07:00:00+00:00 [queued]>
[2023-01-05 09:00:44,123] {taskinstance.py:841} INFO -
--------------------------------------------------------------------------------
[2023-01-05 09:00:44,123] {taskinstance.py:842} INFO - Starting attempt 1 of 5
[2023-01-05 09:00:44,123] {taskinstance.py:843} INFO -
--------------------------------------------------------------------------------
[2023-01-05 09:00:44,184] {taskinstance.py:862} INFO - Executing <Task(NamedHivePartitionSensor): wait_for_data> on 2023-01-05T07:00:00+00:00
[2023-01-05 09:00:44,185] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mediawiki_revision_recommendation_create_hourly', 'wait_for_data', '2023-01-05T07:00:00+00:00', '--job_id', '2569540', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mediawiki_revision_recommendation_create.py', '--cfg_path', '/tmp/tmpxy2o88yr']

And I don't see anything particular in scheduler/2023-01-05/mediawiki_revision_recommendation_create.py.log...

Indeed, this still seems to be going on regardless of the sql connection errors. I'm equally finding a derth of relevant logging. Attempting to correlate the times when logs should have shown up across the various available logs isn't turning up anything of note. I poked through the airflow bits here, but i don't see obvious reasons it could exit. Best guess would be this bit doesn't seem to catch and log execeptions so it could have exited the code path somewhere, but i would still expect it to have been logged somewhere but turned up nothing. Maybe something could be in syslog if t was killed by the kernel or segfaulted or something, but the journald instance only seems to have 8 hours of logs, and I don't have root to read the raw system logs.

For the moment I'm not sure what we can really do with this other than ignore it.

Update documentation to make the failure mode clear and then let's close

Removing from current work as we're not going to investigate further, added a quick note to https://wikitech.wikimedia.org/wiki/Discovery/Analytics#Airflow_Alerts.

dcausse triaged this task as Low priority.
dcausse moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.