I've killed the job and disabled the cronjob temporarily. Will need discussion with DBAs to optimize the report to not consume so much db resources until it can be re-enabled.
Description
Event Timeline
The very-long running query was:
| 21844654 | p50380g50440 | 10.68.xx.xx:54190 | enwiki_p | Execute | 577307 | Sending data | /* confirmedusers.py SLOW_OK */ SELECT CONVERT(user_name USING utf8), user_editcount, rev_timestamp FROM user JOIN user_groups ON ug_user = user_id JOIN revision ON rev_user = user_id AND ug_group = 'confirmed' AND user_editcount > 9 AND (SELECT MIN(rev_timestamp) FROM revision WHERE rev_user = user_id) < DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 4 DAY),'%Y%m%d%H%i%s') AND rev_timestamp = (SELECT MIN(rev_timestamp) FROM revision WHERE rev_user = user_id)
Here the explain:
+------+--------------------+-------------+--------+------------------+--------------------+---------+----------------------------+-----------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+--------------------+-------------+--------+------------------+--------------------+---------+----------------------------+-----------+--------------------------+ | 1 | PRIMARY | user_groups | ref | PRIMARY,ug_group | ug_group | 257 | const | 497 | Using where; Using index | | 1 | PRIMARY | user | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.user_groups.ug_user | 1 | Using where | | 1 | PRIMARY | revision | ref | rev_timestamp | rev_timestamp | 16 | func | 1 | Using where | | 3 | DEPENDENT SUBQUERY | revision | index | NULL | usertext_timestamp | 541 | NULL | 322795979 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | revision | index | NULL | usertext_timestamp | 541 | NULL | 322795979 | Using where; Using index | +------+--------------------+-------------+--------+------------------+--------------------+---------+----------------------------+-----------+--------------------------+
MariaDB [enwiki_p]> select count(*) from user_groups where ug_group = 'confirmed'; +----------+ | count(*) | +----------+ | 495 | +----------+ 1 row in set (0.00 sec)
Presumably we could just do 495 queries against enwiki_p.revision_userindex to get the rev_timestamps and the report would generate a lot faster and would be significantly cheaper to update.
tools.dbreps@tools-bastion-03:~$ crontab -l | grep -B1 confirmedusers # Disabled for using too much Db Resource - YuviPanda (T131956) # 0 1 * * 1,4 jsub -once -j y -mem 280m -N general.confirmedusers -o ~/var/log -quiet ~/bin/dbreps -r general.confirmedusers -s enwiki
Thank you for disabling, @yuvipanda. Sorry about the trouble. :-(
p50380g50440 was running several queries that were never going to stop executing, and causing 1 day of lag on labsdb1001:
Id: 3243394 User: p50380g50440 Host: 10.68.16.126:59189 db: enwiki_p Command: Execute Time: 554131 State: Copying to tmp table Info: /* linkedredlinkedcats.py SLOW_OK */ SELECT CONVERT(cl_to USING utf8), COUNT(*) FROM categorylinks JOIN pagelinks ON pl_title = cl_to AND pl_namespace = 14 JOIN page AS p1 ON pl_from = p1.page_id AND p1.page_namespace IN (0, 6, 10, 12, 14, 100) LEFT JOIN page AS p2 ON cl_to = p2.page_title AND p2.page_namespace = 14 WHERE p2.page_title IS NULL GROUP BY 1 LIMIT 1000 Progress: 0.000 *************************** 1. row *************************** id: 1 select_type: SIMPLE table: page type: range possible_keys: PRIMARY,name_title key: name_title key_len: 4 ref: NULL rows: 24632789 Extra: Using where; Using index; Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: pagelinks type: ref possible_keys: PRIMARY,pl_from,pl_namespace,pl_backlinks_namespace key: PRIMARY key_len: 8 ref: enwiki.page.page_id,const rows: 2 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: categorylinks type: ref possible_keys: cl_timestamp,cl_sortkey key: cl_timestamp key_len: 257 ref: enwiki.pagelinks.pl_title rows: 20 Extra: Using index *************************** 4. row *************************** id: 1 select_type: SIMPLE table: page type: eq_ref possible_keys: name_title key: name_title key_len: 261 ref: const,enwiki.pagelinks.pl_title rows: 1 Extra: Using where; Using index Id: 10706891 User: p50380g50440 Host: 10.68.18.208:39267 db: commonswiki_p Command: Execute Time: 7075 State: Queried about 5850000 rows Info: /* selfcatcats.py SLOW_OK */ SELECT CONVERT(page_title USING utf8), cat_pages, cat_subcats FROM page JOIN categorylinks ON cl_to = page_title RIGHT JOIN category ON cat_title = page_title WHERE page_id = cl_from AND page_namespace = 14 Progress: 0.000
(in particular, note the 6-day query doing a very expensive sort).
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1001&var-port=9104&panelId=11&fullscreen&from=now-7d&to=now
I have banned p50380g50440 from labsdb1001 (not from other replicas or toolsdb, and because a warning had been sent in advance), until there is a guarantee that these queries will not happen again, and to facilitate the replica catching up with replication.
Any objection to close this ticket?
It is pretty old, the problematic job was disabled more than a year ago and the old labsdb servers are going away.