Per iOS team, the bot request is timing out (https://en.wikipedia.org/w/rest.php/v1/page/Dog/history/counts/bot)
Some pages with long histories have no results, but pages with short histories do return quickly.
Per iOS team, the bot request is timing out (https://en.wikipedia.org/w/rest.php/v1/page/Dog/history/counts/bot)
Some pages with long histories have no results, but pages with short histories do return quickly.
I tried running this request from eval.php, but I was not patient enough to find out how long it would have taken. I killed the query after about 10 minutes. SHOW FULL PROCESSLIST before killing:
*************************** 15. row *************************** Id: 14346441862 User: wikiadmin Host: 10.64.32.16:40424 db: enwiki Command: Query Time: 724 State: Sending data Info: SELECT /* MediaWiki\Rest\Handler\PageHistoryCountHandler::getBotCount www-data@deploy... */ COUNT(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) WHERE rev_page = '4269567' AND (EXISTS(SELECT 1 FROM `user_groups` WHERE (actor.actor_user = ug_user) AND ug_group = 'bot' AND (ug_expiry IS NULL OR ug_expiry >= '20191115022503') )) AND ((rev_deleted & 4) != 4) LIMIT 1 Progress: 0.000
I accidentally ran the same query without finding out the DB server it ran on, and killed PHP with ctrl-C, which didn't kill the query. By the time I found the query and killed it, it had been running on db1119 for 1180 seconds.
Explain seems innocuous enough:
MariaDB [enwiki]> explain SELECT /* MediaWiki\Rest\Handler\PageHistoryCountHandler::getBotCount www-data@deploy... */ COUNT(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) WHERE rev_page = '4269567' AND (EXISTS(SELECT 1 FROM `user_groups` WHERE (actor.actor_user = ug_user) AND ug_group = 'bot' AND (ug_expiry IS NULL OR ug_expiry >= '20191115022503') )) AND ((rev_deleted & 4) != 4) LIMIT 1\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: user_groups type: ref possible_keys: PRIMARY,ug_group,ug_expiry key: ug_group key_len: 257 ref: const rows: 303 Extra: Using index condition; Using where *************************** 2. row *************************** id: 1 select_type: PRIMARY table: actor type: ref possible_keys: PRIMARY,actor_user key: actor_user key_len: 5 ref: enwiki.user_groups.ug_user rows: 1 Extra: Using index *************************** 3. row *************************** id: 1 select_type: PRIMARY table: revision_actor_temp type: ref possible_keys: PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp key: actor_timestamp key_len: 8 ref: enwiki.actor.actor_id rows: 6 Extra: *************************** 4. row *************************** id: 1 select_type: PRIMARY table: revision type: eq_ref possible_keys: PRIMARY,rev_page_id,page_timestamp,page_user_timestamp key: PRIMARY key_len: 4 ref: enwiki.revision_actor_temp.revactor_rev rows: 1 Extra: Using where
Starting from simple queries and working up until we have a problem:
MariaDB [enwiki]> select count(*) from revision where rev_page=4269567; +----------+ | count(*) | +----------+ | 11646 | +----------+ 1 row in set (0.01 sec) MariaDB [enwiki]> SELECT COUNT(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) WHERE rev_page = '4269567'; +----------+ | COUNT(*) | +----------+ | 11646 | +----------+ 1 row in set (3.81 sec) MariaDB [enwiki]> SELECT COUNT(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) WHERE rev_page = '4269567' AND (EXISTS(SELECT 1 FROM `user_groups` WHERE (actor.actor_user = ug_user)));
Killed after 81 seconds. So is EXISTS() the problem, or contains the problem. Reformulating with a join instead:
SELECT count(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON actor.actor_user = ug_user AND ug_group='bot' WHERE rev_page=4269567; +----------+ | count(*) | +----------+ | 178 | +----------+ 1 row in set (0.01 sec)
Seems to work. I see from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/538133/19/includes/Rest/Handler/PageHistoryHandler.php#189 that the EXISTS() was added in case there is more than one group to search for, and a given user is in multiple groups. Here reproduced locally, along with an alternate fix, which is to use COUNT(DISTINCT rev_id):
MariaDB [mw]> SELECT count(distinct rev_id),count(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON actor.actor_user = ug_user AND ug_group in ('sysop','bureaucrat') WHERE rev_page=3; +------------------------+----------+ | count(distinct rev_id) | count(*) | +------------------------+----------+ | 11 | 22 | +------------------------+----------+ 1 row in set (0.00 sec)
The performance in production is the same.
So the plot thickened when I wrote a patch and tested the resulting query in production. It was slow. So I isolated the relevant difference:
MariaDB [enwiki]> SELECT count(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page=4269567; +----------+ | count(*) | +----------+ | 178 | +----------+ 1 row in set (0.00 sec) MariaDB [enwiki]> SELECT count(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN `user_groups` ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page = '4269567'; ^CCtrl-C -- query killed. Continuing normally. ERROR 1317 (70100): Query execution was interrupted
Adding quotes to the page_id causes it to take forever. Yeah, you read that right. The query plan changes: for revision_actor_temp the slow query uses the actor_timestamp index, whereas the fast query uses the page_actor_timestamp index:
MariaDB [enwiki]> explain SELECT count(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page='4269567'; +------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------+---------+-----------------------------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------+---------+-----------------------------------------+------+--------------------------+ | 1 | SIMPLE | user_groups | ref | PRIMARY,ug_group | ug_group | 257 | const | 303 | Using where; Using index | | 1 | SIMPLE | actor | ref | PRIMARY,actor_user | actor_user | 5 | enwiki.user_groups.ug_user | 1 | Using index | | 1 | SIMPLE | revision_actor_temp | ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | actor_timestamp | 8 | enwiki.actor.actor_id | 6 | | | 1 | SIMPLE | revision | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY | 4 | enwiki.revision_actor_temp.revactor_rev | 1 | Using where | +------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------+---------+-----------------------------------------+------+--------------------------+ 4 rows in set (0.00 sec) MariaDB [enwiki]> explain SELECT count(*) FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page=4269567; +------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+ | 1 | SIMPLE | user_groups | ref | PRIMARY,ug_group | ug_group | 257 | const | 303 | Using where; Using index | | 1 | SIMPLE | actor | ref | PRIMARY,actor_user | actor_user | 5 | enwiki.user_groups.ug_user | 1 | Using index | | 1 | SIMPLE | revision_actor_temp | ref | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 12 | const,enwiki.actor.actor_id | 1 | Using index | | 1 | SIMPLE | revision | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY | 4 | enwiki.revision_actor_temp.revactor_rev | 1 | Using where | +------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+ 4 rows in set (0.00 sec)
I returned to the original query with EXISTS(), and removed the quotes, and it ran quickly.
Adding quotes to numeric parameters was a decision made by Brion in 2004, rSVN4984, to fix T2255. Really the problem at the time was incorrect use of is_numeric(). If is_int() was used then it would have been OK. But revisiting that decision now is pretty scary.
Change 551024 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Have Database::addQuotes() pass through bare integers without quoting
Interesting the spike it caused on db1119: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1119&var-port=9104&from=1573767770675&to=1573798922867
Adding quotes to the page_id causes it to take forever. Yeah, you read that right. The query plan changes: for revision_actor_temp the slow query uses the actor_timestamp index, whereas the fast query uses the page_actor_timestamp index:
Good catch, and pretty scary indeed.
What is also very interesting is how the explain doesn't really change if you have the quotes or not. Not only the EXPLAIN SELECT but the SHOW EXPLAIN FOR which sometimes can disagree and show different things from the theoretical explain than the _actual_ explain.
Change 551037 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Workaround for MariaDB query plan bug
I've run into something like that once before (T140302), although in that case it was only out-of-range values that caused the problem.
This was impressive to watch.
Is this patch applied in production, or do we need to do something about it?
Change 551037 merged by jenkins-bot:
[mediawiki/core@master] Workaround for MariaDB query plan bug
Change 551353 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@wmf/1.35.0-wmf.5] [wmf/1.35.0-wmf.5] Workaround for MariaDB query plan bug
Change 551353 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.5] [wmf/1.35.0-wmf.5] Workaround for MariaDB query plan bug
Mentioned in SAL (#wikimedia-operations) [2019-11-18T00:44:24Z] <tstarling@deploy1001> Synchronized php-1.35.0-wmf.5/includes/Rest/Handler/PageHistoryCountHandler.php: fix extremely slow query T238378 (duration: 00m 59s)
Change 551024 merged by jenkins-bot:
[mediawiki/core@master] Have Database::addQuotes() pass through bare integers without quoting
Seems like this has caused T239877 because of a matching but opposite error in the query planner. My guess is that WHERE foo=1 skips the index for foo if foo is a string type, perhaps because it wants to convert each foo string to an integer individually to account for 01 and -0 and similar isomorphisms.