Maniphest T172086

Gigantic query results cause a SIGKILL and the query status do not update
Open, LowPublicBUG REPORT
Actions

Assigned To

None

Authored By

	zhuyifei1999
	Jul 31 2017, 4:35 AM

Description

Tracked down this query:

SELECT *
FROM categorylinks

MariaDB [quarry]> select * from query where id = 20534;
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
| id    | user_id | title | latest_rev_id | last_touched        | parent_id | published | description               |
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
| 20534 |    2021 | NULL  |        197241 | 2017-07-27 12:21:21 |      NULL |         0 | Select all category links |
+-------+---------+-------+---------------+---------------------+-----------+-----------+---------------------------+
1 row in set (0.00 sec)

MariaDB [quarry]> select * from query_run where query_rev_id = 197241;
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| id     | query_rev_id | status | timestamp           | task_id                              | extra_info |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
| 193235 |       197241 |      2 | 2017-07-27 12:21:21 | 20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6 | NULL       |
+--------+--------------+--------+---------------------+--------------------------------------+------------+
1 row in set (0.00 sec)

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/syslog*.gz | grep 20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6 -A 5
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: [2017-07-27 12:34:05,006: ERROR/MainProcess] Task worker.run_query[20f5aad9-64c5-4f5b-8360-5fcc4e4de5c6] raised unexpected: Terminated(9,)
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: Traceback (most recent call last):
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: File "/usr/lib/python2.7/dist-packages/billiard/pool.py", line 1673, in _set_terminated
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: raise Terminated(-(signum or 0))
Jul 27 12:34:05 quarry-runner-01 celery-quarry-worker[591]: Terminated: 9
Jul 27 12:35:01 quarry-runner-01 CRON[21232]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

It is probably an OOM due to the excessive data size. Unfortunately SIGKILL cannot be caught by the inner worker process and therefore kept a 'Running' state.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	BUG REPORT	None	T172086 Gigantic query results cause a SIGKILL and the query status do not update
		Open	BUG REPORT	None	T278583 Quarry should detect a dead worker and report something better than "running" forever

Event Timeline

zhuyifei1999 created this task.Jul 31 2017, 4:35 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2017, 4:35 AM

Ebe123 added projects: MediaWiki-libs-Rdbms, MediaWiki-Categories.Jul 31 2017, 5:32 AM

Oops, sorry forgot to add the tags

zhuyifei1999 moved this task from Backlog to Quarry replacement (Superset) on the Quarry board.Jul 31 2017, 9:44 AM

Recent OOMs:

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep python2.7
Jul 21 05:11:05 quarry-runner-01 kernel: [2033363.352203] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:38:48 quarry-runner-01 kernel: [1135026.829074] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 20:58:04 quarry-runner-01 kernel: [1139782.184795] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 12 03:13:04 quarry-runner-01 kernel: [1248682.298552] python2.7 invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0

zhuyifei1999@quarry-runner-02:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep python2.7
Jul 29 00:22:07 quarry-runner-02 kernel: [2707221.497408] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 21 05:12:05 quarry-runner-02 kernel: [2033426.048519] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 11 13:41:04 quarry-runner-02 kernel: [1199972.725694] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

Uh,

zhuyifei1999@quarry-runner-01:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep invoked
Jul 27 12:34:04 quarry-runner-01 kernel: [2578342.319551] prometheus-pupp invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 29 15:08:06 quarry-runner-01 kernel: [2760384.263448] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 21 05:11:05 quarry-runner-01 kernel: [2033363.352203] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:28:04 quarry-runner-01 kernel: [1134382.335763] prometheus-pupp invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 19:38:48 quarry-runner-01 kernel: [1135026.829074] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 10 20:58:04 quarry-runner-01 kernel: [1139782.184795] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 12 03:13:04 quarry-runner-01 kernel: [1248682.298552] python2.7 invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0

zhuyifei1999@quarry-runner-02:~$ zcat /var/log/messages*.gz | cat /var/log/messages* - | grep oom | grep invoked
Jul 29 00:22:07 quarry-runner-02 kernel: [2707221.497408] python2.7 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
Jul 21 05:12:05 quarry-runner-02 kernel: [2033426.048519] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 10 19:40:04 quarry-runner-02 kernel: [1135113.126689] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul 11 13:41:04 quarry-runner-02 kernel: [1199972.725694] python2.7 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Jul  6 22:49:04 quarry-runner-02 kernel: [800856.951173] prometheus-pupp invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0

There was an kernel-made OOM kill 1 second prior to the query kill, but it wasn't celery being killed. Weird.

zhuyifei1999 mentioned this in T170464: Quarry cannot store results with identical column names.Jul 31 2017, 10:36 AM

The current storing query status in the same process as the storing query results isn't going to work. SIGKILL cannot be caught, so only the celery master process knows about the worker being lost, but the master AFAICT only stores the status into the result backend, which is not how we store the status.

We can try to migrate to celery result backend, but that may be quite some work. Alternatively we can fork and store the results in a child process, but that is partially reinventing the wheel.

zhuyifei1999 mentioned this in T178751: Provide an API so external tools can run SQL queries via Quarry.Oct 22 2017, 5:11 AM

Framawiki subscribed.Nov 11 2017, 6:18 AM

zhuyifei1999 mentioned this in T188564: Quarry should refuse to save results that are way too large.Feb 28 2018, 11:04 PM

Mentioned in SAL (#wikimedia-cloud) [2018-03-01T00:37:23Z] <zhuyifei1999_> UPDATE query join query_revision on query.latest_rev_id = query_revision.id join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or status = 1) and query_run.timestamp <= DATE_ADD(NOW(), INTERVAL -1 HOUR); 251 rows affected (1.81 sec) T139162 T172086 T188564

With T188564 fixed, hopefully this won't be necessary.

zhuyifei1999 mentioned this in T246970: Quarry: Lost connection to MySQL server during query.Mar 14 2020, 7:07 AM

• Bstorm added a subtask: T278583: Quarry should detect a dead worker and report something better than "running" forever.Mar 26 2021, 7:54 PM

Framawiki moved this task from Quarry replacement (Superset) to Bugs on the Quarry board.May 22 2023, 7:46 PM

Aklapper changed the subtype of this task from "Task" to "Bug Report".Jul 18 2023, 12:20 PM

Gigantic query results cause a SIGKILL and the query status do not updateOpen, LowPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

Gigantic query results cause a SIGKILL and the query status do not update
Open, LowPublicBUG REPORT
Actions

Related Objects
Search...