Page MenuHomePhabricator

Gigantic query results cause a SIGKILL and the query status do not update
Open, LowPublic

Description

Tracked down this query:

SELECT *
FROM categorylinks
MariaDB [quarry]> select * from query where id = 20534;
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
| id    | user_id | title | latest_rev_id | last_touched        | parent_id | published | description               |
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
| 20534 |    2021 | NULL  |        197241 | 2017-07-27 12:21:21 |      NULL |         0 | Select all category links |
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
1 row in set (0.00 sec)

MariaDB [quarry]> select * from query_run where query_rev_id = 197241;
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| id     | query_rev_id | status | timestamp           | task_id                              | extra_info |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| 193235 |       197241 |      2 | 2017-07-27 12:21:21 | 20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6 | NULL       |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
1 row in set (0.00 sec)
zhuyifei1999@quarry-runner-01:~$ zcat /var/log/syslog*.gz | grep 20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6 -A 5
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: [2017-07-27 12:34:05,006: ERROR/MainProcess] Task worker.run_query[20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6] raised unexpected: Terminated(9,)
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: Traceback (most recent call last):
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: File "/usr/lib/python2.7/dist-packages/billiard/pool.py", line 1673, in _set_terminated
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: raise Terminated(-(signum or 0))
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: Terminated: 9
Jul 27 12:35:01 quarry-runner-01 CRON[21232]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

It is probably an OOM due to the excessive data size. Unfortunately SIGKILL cannot be caught by the inner worker process and therefore kept a 'Running' state.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2017, 4:35 AM

Oops, sorry forgot to add the tags

Recent OOMs:

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep python2.7
Jul 21 05:11:05 quarry-runner-01 kernel: [2033363.352203] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:38:48 quarry-runner-01 kernel: [1135026.829074] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 20:58:04 quarry-runner-01 kernel: [1139782.184795] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 12 03:13:04 quarry-runner-01 kernel: [1248682.298552] python2.7 invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
zhuyifei1999@quarry-runner-02:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep python2.7
Jul 29 00:22:07 quarry-runner-02 kernel: [2707221.497408] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 21 05:12:05 quarry-runner-02 kernel: [2033426.048519] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 11 13:41:04 quarry-runner-02 kernel: [1199972.725694] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

Uh,

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep invoked
Jul 27 12:34:04 quarry-runner-01 kernel: [2578342.319551] prometheus-pupp invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 29 15:08:06 quarry-runner-01 kernel: [2760384.263448] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 21 05:11:05 quarry-runner-01 kernel: [2033363.352203] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:28:04 quarry-runner-01 kernel: [1134382.335763] prometheus-pupp invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 19:38:48 quarry-runner-01 kernel: [1135026.829074] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 20:58:04 quarry-runner-01 kernel: [1139782.184795] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 12 03:13:04 quarry-runner-01 kernel: [1248682.298552] python2.7 invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
zhuyifei1999@quarry-runner-02:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep invoked
Jul 29 00:22:07 quarry-runner-02 kernel: [2707221.497408] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 21 05:12:05 quarry-runner-02 kernel: [2033426.048519] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:40:04 quarry-runner-02 kernel: [1135113.126689] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 11 13:41:04 quarry-runner-02 kernel: [1199972.725694] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul  6 22:49:04 quarry-runner-02 kernel: [800856.951173] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

There was an kernel-made OOM kill 1 second prior to the query kill, but it wasn't celery being killed. Weird.

The current storing query status in the same process as the storing query results isn't going to work. SIGKILL cannot be caught, so only the celery master process knows about the worker being lost, but the master AFAICT only stores the status into the result backend, which is not how we store the status.

We can try to migrate to celery result backend, but that may be quite some work. Alternatively we can fork and store the results in a child process, but that is partially reinventing the wheel.

Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-cloud) [2018-03-01T00:37:23Z] <zhuyifei1999_> UPDATE query join query_revision on query.latest_rev_id = query_revision.id join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or status = 1) and query_run.timestamp <= DATE_ADD(NOW(), INTERVAL -1 HOUR); 251 rows affected (1.81 sec) T139162 T172086 T188564

zhuyifei1999 triaged this task as Low priority.Mar 1 2018, 6:48 PM

With T188564 fixed, hopefully this won't be necessary.