Page MenuHomePhabricator

Bot edit count endpoint is timing out
Closed, ResolvedPublicBUG REPORT

Description

Per iOS team, the bot request is timing out (https://en.wikipedia.org/w/rest.php/v1/page/Dog/history/counts/bot)

Some pages with long histories have no results, but pages with short histories do return quickly.

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Nov 14, 9:55 PM
eprodromou triaged this task as High priority.Thu, Nov 14, 10:13 PM
eprodromou updated the task description. (Show Details)Thu, Nov 14, 10:48 PM
WDoranWMF changed the subtype of this task from "Task" to "Bug Report".Thu, Nov 14, 10:48 PM

I tried running this request from eval.php, but I was not patient enough to find out how long it would have taken. I killed the query after about 10 minutes. SHOW FULL PROCESSLIST before killing:

*************************** 15. row ***************************
      Id: 14346441862
    User: wikiadmin
    Host: 10.64.32.16:40424
      db: enwiki
 Command: Query
    Time: 724
   State: Sending data
    Info: SELECT /* MediaWiki\Rest\Handler\PageHistoryCountHandler::getBotCount www-data@deploy... */  COUNT(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id))   WHERE rev_page = '4269567' AND (EXISTS(SELECT  1  FROM `user_groups`    WHERE (actor.actor_user = ug_user) AND ug_group = 'bot' AND (ug_expiry IS NULL OR ug_expiry >= '20191115022503')  )) AND ((rev_deleted & 4) != 4)  LIMIT 1
Progress: 0.000

I accidentally ran the same query without finding out the DB server it ran on, and killed PHP with ctrl-C, which didn't kill the query. By the time I found the query and killed it, it had been running on db1119 for 1180 seconds.

Explain seems innocuous enough:

MariaDB [enwiki]> explain SELECT /* MediaWiki\Rest\Handler\PageHistoryCountHandler::getBotCount www-data@deploy... */  COUNT(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id))   WHERE rev_page = '4269567' AND (EXISTS(SELECT  1  FROM `user_groups`    WHERE (actor.actor_user = ug_user) AND ug_group = 'bot' AND (ug_expiry IS NULL OR ug_expiry >= '20191115022503')  )) AND ((rev_deleted & 4) != 4)  LIMIT 1\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: user_groups
         type: ref
possible_keys: PRIMARY,ug_group,ug_expiry
          key: ug_group
      key_len: 257
          ref: const
         rows: 303
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: actor
         type: ref
possible_keys: PRIMARY,actor_user
          key: actor_user
      key_len: 5
          ref: enwiki.user_groups.ug_user
         rows: 1
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: revision_actor_temp
         type: ref
possible_keys: PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp
          key: actor_timestamp
      key_len: 8
          ref: enwiki.actor.actor_id
         rows: 6
        Extra: 
*************************** 4. row ***************************
           id: 1
  select_type: PRIMARY
        table: revision
         type: eq_ref
possible_keys: PRIMARY,rev_page_id,page_timestamp,page_user_timestamp
          key: PRIMARY
      key_len: 4
          ref: enwiki.revision_actor_temp.revactor_rev
         rows: 1
        Extra: Using where

Starting from simple queries and working up until we have a problem:

MariaDB [enwiki]> select count(*) from revision where rev_page=4269567;
+----------+
| count(*) |
+----------+
|    11646 |
+----------+
1 row in set (0.01 sec)

MariaDB [enwiki]> SELECT COUNT(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id))   WHERE rev_page = '4269567';
+----------+
| COUNT(*) |
+----------+
|    11646 |
+----------+
1 row in set (3.81 sec)

MariaDB [enwiki]> SELECT COUNT(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id))   WHERE rev_page = '4269567' AND (EXISTS(SELECT  1  FROM `user_groups`    WHERE (actor.actor_user = ug_user)));

Killed after 81 seconds. So is EXISTS() the problem, or contains the problem. Reformulating with a join instead:

SELECT count(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON actor.actor_user = ug_user AND ug_group='bot' WHERE rev_page=4269567;
+----------+
| count(*) |
+----------+
|      178 |
+----------+
1 row in set (0.01 sec)

Seems to work. I see from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/538133/19/includes/Rest/Handler/PageHistoryHandler.php#189 that the EXISTS() was added in case there is more than one group to search for, and a given user is in multiple groups. Here reproduced locally, along with an alternate fix, which is to use COUNT(DISTINCT rev_id):

MariaDB [mw]> SELECT count(distinct rev_id),count(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON actor.actor_user = ug_user AND ug_group in ('sysop','bureaucrat')  WHERE rev_page=3;
+------------------------+----------+
| count(distinct rev_id) | count(*) |
+------------------------+----------+
|                     11 |       22 |
+------------------------+----------+
1 row in set (0.00 sec)

The performance in production is the same.

So the plot thickened when I wrote a patch and tested the resulting query in production. It was slow. So I isolated the relevant difference:

MariaDB [enwiki]> SELECT count(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page=4269567;
+----------+
| count(*) |
+----------+
|      178 |
+----------+
1 row in set (0.00 sec)

MariaDB [enwiki]> SELECT count(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN `user_groups` ON ((actor.actor_user = ug_user) AND ug_group='bot')   WHERE rev_page = '4269567';
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted

Adding quotes to the page_id causes it to take forever. Yeah, you read that right. The query plan changes: for revision_actor_temp the slow query uses the actor_timestamp index, whereas the fast query uses the page_actor_timestamp index:

MariaDB [enwiki]> explain SELECT count(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page='4269567';
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------+---------+-----------------------------------------+------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key             | key_len | ref                                     | rows | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------+---------+-----------------------------------------+------+--------------------------+
|    1 | SIMPLE      | user_groups         | ref    | PRIMARY,ug_group                                          | ug_group        | 257     | const                                   |  303 | Using where; Using index |
|    1 | SIMPLE      | actor               | ref    | PRIMARY,actor_user                                        | actor_user      | 5       | enwiki.user_groups.ug_user              |    1 | Using index              |
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | actor_timestamp | 8       | enwiki.actor.actor_id                   |    6 |                          |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY         | 4       | enwiki.revision_actor_temp.revactor_rev |    1 | Using where              |
+------+-------------+---------------------+--------+-----------------------------------------------------------+-----------------+---------+-----------------------------------------+------+--------------------------+
4 rows in set (0.00 sec)

MariaDB [enwiki]> explain SELECT count(*)  FROM `revision_actor_temp` JOIN `revision` ON ((revactor_rev = rev_id AND revactor_page = rev_page)) JOIN `actor` ON ((revactor_actor = actor_id)) JOIN user_groups ON ((actor.actor_user = ug_user) AND ug_group='bot') WHERE rev_page=4269567;
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
| id   | select_type | table               | type   | possible_keys                                             | key                  | key_len | ref                                     | rows | Extra                    |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
|    1 | SIMPLE      | user_groups         | ref    | PRIMARY,ug_group                                          | ug_group             | 257     | const                                   |  303 | Using where; Using index |
|    1 | SIMPLE      | actor               | ref    | PRIMARY,actor_user                                        | actor_user           | 5       | enwiki.user_groups.ug_user              |    1 | Using index              |
|    1 | SIMPLE      | revision_actor_temp | ref    | PRIMARY,revactor_rev,actor_timestamp,page_actor_timestamp | page_actor_timestamp | 12      | const,enwiki.actor.actor_id             |    1 | Using index              |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp    | PRIMARY              | 4       | enwiki.revision_actor_temp.revactor_rev |    1 | Using where              |
+------+-------------+---------------------+--------+-----------------------------------------------------------+----------------------+---------+-----------------------------------------+------+--------------------------+
4 rows in set (0.00 sec)

I returned to the original query with EXISTS(), and removed the quotes, and it ran quickly.

Adding quotes to numeric parameters was a decision made by Brion in 2004, rSVN4984, to fix T2255. Really the problem at the time was incorrect use of is_numeric(). If is_int() was used then it would have been OK. But revisiting that decision now is pretty scary.

Change 551024 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Have Database::addQuotes() pass through bare integers without quoting

https://gerrit.wikimedia.org/r/551024

Interesting the spike it caused on db1119: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1119&var-port=9104&from=1573767770675&to=1573798922867

Adding quotes to the page_id causes it to take forever. Yeah, you read that right. The query plan changes: for revision_actor_temp the slow query uses the actor_timestamp index, whereas the fast query uses the page_actor_timestamp index:

Good catch, and pretty scary indeed.
What is also very interesting is how the explain doesn't really change if you have the quotes or not. Not only the EXPLAIN SELECT but the SHOW EXPLAIN FOR which sometimes can disagree and show different things from the theoretical explain than the _actual_ explain.

Change 551037 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Workaround for MariaDB query plan bug

https://gerrit.wikimedia.org/r/551037

Adding quotes to the page_id causes it to take forever. Yeah, you read that right. The query plan changes: for revision_actor_temp the slow query uses the actor_timestamp index, whereas the fast query uses the page_actor_timestamp index:

I've run into something like that once before (T140302), although in that case it was only out-of-range values that caused the problem.

This was impressive to watch.

Is this patch applied in production, or do we need to do something about it?

I'm going to deploy it now.

Change 551037 merged by jenkins-bot:
[mediawiki/core@master] Workaround for MariaDB query plan bug

https://gerrit.wikimedia.org/r/551037

Change 551353 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@wmf/1.35.0-wmf.5] [wmf/1.35.0-wmf.5] Workaround for MariaDB query plan bug

https://gerrit.wikimedia.org/r/551353

Change 551353 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.5] [wmf/1.35.0-wmf.5] Workaround for MariaDB query plan bug

https://gerrit.wikimedia.org/r/551353

Mentioned in SAL (#wikimedia-operations) [2019-11-18T00:44:24Z] <tstarling@deploy1001> Synchronized php-1.35.0-wmf.5/includes/Rest/Handler/PageHistoryCountHandler.php: fix extremely slow query T238378 (duration: 00m 59s)

tstarling closed this task as Resolved.Mon, Nov 18, 12:47 AM

Deployed now. The link in the task description now returns quickly.

Change 551024 merged by jenkins-bot:
[mediawiki/core@master] Have Database::addQuotes() pass through bare integers without quoting

https://gerrit.wikimedia.org/r/551024

cscott added a subscriber: cscott.Thu, Dec 5, 8:06 PM

Seems like this has caused T239877 because of a matching but opposite error in the query planner. My guess is that WHERE foo=1 skips the index for foo if foo is a string type, perhaps because it wants to convert each foo string to an integer individually to account for 01 and -0 and similar isomorphisms.