Page MenuHomePhabricator

Gigantic query results cause a SIGKILL and the query status do not update
Open, LowPublicBUG REPORT

Description

Tracked down this query:

SELECT *
FROM categorylinks
MariaDB [quarry]> select * from query where id = 20534;
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
| id    | user_id | title | latest_rev_id | last_touched        | parent_id | published | description               |
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
| 20534 |    2021 | NULL  |        197241 | 2017-07-27 12:21:21 |      NULL |         0 | Select all category links |
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
1 row in set (0.00 sec)

MariaDB [quarry]> select * from query_run where query_rev_id = 197241;
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| id     | query_rev_id | status | timestamp           | task_id                              | extra_info |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| 193235 |       197241 |      2 | 2017-07-27 12:21:21 | 20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6 | NULL       |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
1 row in set (0.00 sec)
zhuyifei1999@quarry-runner-01:~$ zcat /var/log/syslog*.gz | grep 20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6 -A 5
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: [2017-07-27 12:34:05,006: ERROR/MainProcess] Task worker.run_query[20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6] raised unexpected: Terminated(9,)
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: Traceback (most recent call last):
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: File "/usr/lib/python2.7/dist-packages/billiard/pool.py", line 1673, in _set_terminated
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: raise Terminated(-(signum or 0))
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: Terminated: 9
Jul 27 12:35:01 quarry-runner-01 CRON[21232]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

It is probably an OOM due to the excessive data size. Unfortunately SIGKILL cannot be caught by the inner worker process and therefore kept a 'Running' state.

Event Timeline

Recent OOMs:

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep python2.7
Jul 21 05:11:05 quarry-runner-01 kernel: [2033363.352203] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:38:48 quarry-runner-01 kernel: [1135026.829074] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 20:58:04 quarry-runner-01 kernel: [1139782.184795] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 12 03:13:04 quarry-runner-01 kernel: [1248682.298552] python2.7 invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
zhuyifei1999@quarry-runner-02:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep python2.7
Jul 29 00:22:07 quarry-runner-02 kernel: [2707221.497408] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 21 05:12:05 quarry-runner-02 kernel: [2033426.048519] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 11 13:41:04 quarry-runner-02 kernel: [1199972.725694] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

Uh,

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep invoked
Jul 27 12:34:04 quarry-runner-01 kernel: [2578342.319551] prometheus-pupp invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 29 15:08:06 quarry-runner-01 kernel: [2760384.263448] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 21 05:11:05 quarry-runner-01 kernel: [2033363.352203] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:28:04 quarry-runner-01 kernel: [1134382.335763] prometheus-pupp invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 19:38:48 quarry-runner-01 kernel: [1135026.829074] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 20:58:04 quarry-runner-01 kernel: [1139782.184795] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 12 03:13:04 quarry-runner-01 kernel: [1248682.298552] python2.7 invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
zhuyifei1999@quarry-runner-02:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep invoked
Jul 29 00:22:07 quarry-runner-02 kernel: [2707221.497408] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 21 05:12:05 quarry-runner-02 kernel: [2033426.048519] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:40:04 quarry-runner-02 kernel: [1135113.126689] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 11 13:41:04 quarry-runner-02 kernel: [1199972.725694] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul  6 22:49:04 quarry-runner-02 kernel: [800856.951173] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

There was an kernel-made OOM kill 1 second prior to the query kill, but it wasn't celery being killed. Weird.

The current storing query status in the same process as the storing query results isn't going to work. SIGKILL cannot be caught, so only the celery master process knows about the worker being lost, but the master AFAICT only stores the status into the result backend, which is not how we store the status.

We can try to migrate to celery result backend, but that may be quite some work. Alternatively we can fork and store the results in a child process, but that is partially reinventing the wheel.

Stashbot subscribed.

Mentioned in SAL (#wikimedia-cloud) [2018-03-01T00:37:23Z] <zhuyifei1999_> UPDATE query join query_revision on query.latest_rev_id = query_revision.id join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or status = 1) and query_run.timestamp <= DATE_ADD(NOW(), INTERVAL -1 HOUR); 251 rows affected (1.81 sec) T139162 T172086 T188564

With T188564 fixed, hopefully this won't be necessary.

Aklapper changed the subtype of this task from "Task" to "Bug Report".Jul 18 2023, 12:20 PM