Page MenuHomePhabricator

prop=revisions API timing out for a specific user and pages they edited
Closed, ResolvedPublic

Description

This was reported to me because someone was having trouble using Twinkle. I looked into it and this API endpoint is timing out:

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=sha1%7Cids%7Ctimestamp%7Ccomment&rvlimit=100&rvstartid=845694425&rvexcludeuser=Tyagiaman&indexpageids=true&redirects=true&titles=tyagi

{

  "error": {

    "code": "internal_api_error_DBQueryTimeoutError",

    "info": "[WyPx5ApAMD8AADNTBqgAAABR] Database query error."

  },

  "servedby": "mw1228"

}

If I change it to MusikAnimal, it works:

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=sha1%7Cids%7Ctimestamp%7Ccomment&rvlimit=100&rvstartid=845694425&rvexcludeuser=MusikAnimal&indexpageids=true&redirects=true&titles=tyagi

So it seems this is an issue with User:Tyagiaman, specifically.

Event Timeline

Paladox raised the priority of this task from High to Unbreak Now!.Jun 15 2018, 5:08 PM
Paladox subscribed.

Affects prod so UBN. Also it loads very very very slow that url.

MusikAnimal renamed this task from prop=revisions API timing out for a specific user to prop=revisions API timing out for a specific user and pages they edited.Jun 15 2018, 5:20 PM

rvexcludeuser may be the problem here. Queries with explicitly exclude something are usually expensive (where not exists, where X not in, or where A <> B). Maybe this is causing the query to not using the proper indices.

The exception is indeed a slow query getting killed. An example query is:

SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len  FROM `revision` LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((page_id = rev_page))   WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = '6097020' AND (NOT((rev_user = '2619640'))) AND ((rev_deleted & 12) != 12)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;

We can simplify the query to still reproduce the bug:

wikiadmin@10.64.16.13(enwiki)>explain SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+----------+-------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows     | Extra       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+----------+-------------+
|    1 | SIMPLE      | revision | index | PRIMARY,rev_page_id,page_timestamp,rev_timestamp,user_timestamp,page_user_timestamp | page_timestamp | 20      | NULL | 41239284 | Using where |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+----------+-------------+

The odd thing here is that even using page_timestamp it should only have to touch a maximum of about 1676 rows when using the page_timestamp index, since that's how many rows pass rev_page = 6097020. This reminds me of T196526: Geodata running long running queries on Commons, and in fact I find the same oddities here that I described in T196526#4276656.

wikiadmin@10.64.16.13(enwiki)>explain SELECT * FROM revision ignore index(page_user_timestamp) WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+-----------------------------------------------------------------+----------------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys                                                   | key            | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+-----------------------------------------------------------------+----------------+---------+------+------+-------------+
|    1 | SIMPLE      | revision | range | PRIMARY,rev_page_id,page_timestamp,rev_timestamp,user_timestamp | page_timestamp | 24      | NULL | 1666 | Using where |
+------+-------------+----------+-------+-----------------------------------------------------------------+----------------+---------+------+------+-------------+
wikiadmin@10.64.16.13(enwiki)>explain SELECT * FROM revision ignore index(page_timestamp) WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+----------------------------------------------------------------------+---------------------+---------+------+------+----------------------------------------------------+
| id   | select_type | table    | type  | possible_keys                                                        | key                 | key_len | ref  | rows | Extra                                              |
+------+-------------+----------+-------+----------------------------------------------------------------------+---------------------+---------+------+------+----------------------------------------------------+
|    1 | SIMPLE      | revision | range | PRIMARY,rev_page_id,rev_timestamp,user_timestamp,page_user_timestamp | page_user_timestamp | 8       | NULL | 1660 | Using index condition; Using where; Using filesort |
+------+-------------+----------+-------+----------------------------------------------------------------------+---------------------+---------+------+------+----------------------------------------------------+

And if I try it on db1052 which is still on MariaDB 10.0.28, it seems the bug isn't present there either:

wikiadmin@db1052(enwiki)>explain SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+---------------------+---------+------+------+----------------------------------------------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key                 | key_len | ref  | rows | Extra                                              |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+---------------------+---------+------+------+----------------------------------------------------+
|    1 | SIMPLE      | revision | range | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_user_timestamp | 8       | NULL | 1660 | Using index condition; Using where; Using filesort |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+---------------------+---------+------+------+----------------------------------------------------+

It's starting to look like MariaDB 10.1.33 has a planner bug where multiple candidate indexes cause it to choose a completely stupid plan that wants to scan the whole table using the index and filesort it.

I think what it is trying to do is to use one optimization for sorting when limit is used (priority queue). I will try to see if I can disable that on optimizer_switch.

jcrespo lowered the priority of this task from Unbreak Now! to High.Jun 15 2018, 6:10 PM

I don't think this is unbreak now, only a single combination of values have a performance issue- the functionalitly in general is working.

Vvjjkkii renamed this task from prop=revisions API timing out for a specific user and pages they edited to ltaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from ltaaaaaaaa to prop=revisions API timing out for a specific user and pages they edited.Jul 1 2018, 8:07 PM
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)

For what is worth, this is still happening on 10.1.36

root@db1119.eqiad.wmnet[(none)]> show explain for 8203632;
+------+-------------+---------------------+--------+-------------------------------------------------------------------------------------+----------------+---------+-----------------------------------------------+-----------+-----------------------------+
| id   | select_type | table               | type   | possible_keys                                                                       | key            | key_len | ref                                           | rows      | Extra                       |
+------+-------------+---------------------+--------+-------------------------------------------------------------------------------------+----------------+---------+-----------------------------------------------+-----------+-----------------------------+
|    1 | SIMPLE      | revision            | index  | PRIMARY,rev_page_id,page_timestamp,rev_timestamp,user_timestamp,page_user_timestamp | page_timestamp | 20      | NULL                                          | 676646177 | Using where; Using filesort |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                                              | PRIMARY        | 4       | enwiki.revision.rev_id                        |         1 | Using index                 |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                                             | PRIMARY        | 8       | enwiki.temp_rev_comment.revcomment_comment_id |         1 | Using where                 |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                                             | PRIMARY        | 4       | enwiki.revision.rev_page                      |         1 |                             |
+------+-------------+---------------------+--------+-------------------------------------------------------------------------------------+----------------+---------+-----------------------------------------------+-----------+-----------------------------+
4 rows in set, 1 warning (0.01 sec)

Interesting that this doesn't happen on RC slaves, which obviously have a different schema for revision table.

As discussed via email, this bug was fixed on 10.1.37: https://jira.mariadb.org/browse/MDEV-17155 which could end up "fixing" this issue. We'll see once 10.1.37 is released and we can test it with these queries.

Not sure if we're on the new MariaDB that was meant to fix the issue, but the issue apparently still stands:

https://en.wikipedia.org/w/api.php?action=query&format=json&prop=revisions&rvprop=sha1%7Cids%7Ctimestamp%7Ccomment&rvlimit=100&rvstartid=865870092&rvexcludeuser=RLGoodwin&indexpageids=true&redirects=true&titles=Summation

This time I get a full blown "Our servers are currently under maintenance or experiencing a technical problem" HTML response, not the JSON error response I got when I reported this bug.

I'm pretty sure we're not on 10.1.37 yet.

https://downloads.mariadb.org/ 10.1.37 not yet considered stable at the time of writing this. While we could deploy something from the tree, that is a big no for database code (unlike other kinds of code I don't have a problem with doing that) unless there is an unbreak now bug. Once it is officially released I will build it and test it.

Yeah, I am curious to see if 10.1.37 really fixes this :-)

As discussed via email, this bug was fixed on 10.1.37: https://jira.mariadb.org/browse/MDEV-17155 which could end up "fixing" this issue. We'll see once 10.1.37 is released and we can test it with these queries.

Looks like we were indeed hit by this bug. Upgrading to 10.1.37 shows a decent query plan now:
I have tested upgrading db2071 (enwiki) to 10.1.37

10.1.32:

+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+----------+-------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows     | Extra       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+----------+-------------+
|    1 | SIMPLE      | revision | index | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 20      | NULL | 48743320 | Using where |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+----------+-------------+

Upgrade to 10.1.37:

+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
|    1 | SIMPLE      | revision | range | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 24      | NULL | 1666 | Using where |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+

Looks like this only happens on enwiki (I guess because of the size of the revision table there).
I have upgraded db1080 to 10.1.37:

root@db1080.eqiad.wmnet[enwiki]> select @@hostname; select version();
+------------+
| @@hostname |
+------------+
| db1080     |
+------------+
1 row in set (0.00 sec)

+-----------------+
| version()       |
+-----------------+
| 10.1.33-MariaDB |
+-----------------+
1 row in set (0.00 sec)

root@db1080.eqiad.wmnet[enwiki]> nopager;
PAGER set to stdout
root@db1080.eqiad.wmnet[enwiki]> SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
^CCtrl-C -- query killed. Continuing normally.
^CCtrl-C -- query killed. Continuing normally.
^CCtrl-C -- exit!
Aborted
(killed after 5 minutes)

root@db1080.eqiad.wmnet[enwiki]> show explain for 12533151048;
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+-----------+-----------------------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows      | Extra                       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+-----------+-----------------------------+
|    1 | SIMPLE      | revision | index | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 20      | NULL | 717569532 | Using where; Using filesort |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+-----------+-----------------------------+
1 row in set, 1 warning (0.00 sec)

After the upgrade to 10.1.37:

root@db1080.eqiad.wmnet[(none)]> select @@hostname; select version();
+------------+
| @@hostname |
+------------+
| db1080     |
+------------+
1 row in set (0.01 sec)

+-----------------+
| version()       |
+-----------------+
| 10.1.37-MariaDB |
+-----------------+
1 row in set (0.00 sec)

root@db1080.eqiad.wmnet[enwiki]> SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;

<snip>

101 rows in set (0.11 sec)

root@db1080.eqiad.wmnet[enwiki]> explain SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
|    1 | SIMPLE      | revision | range | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 24      | NULL | 1666 | Using where |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+

Also happens on eswiki (s7 - the biggest revision table of all those wikis on s7, 44GB) (db1090:3317 - 10.1.33):

root@db1090.eqiad.wmnet[eswiki]> explain SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+-----------+-------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows      | Extra       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+-----------+-------------+
|    1 | SIMPLE      | revision | index | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 20      | NULL | 100378501 | Using where |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+-----------+-------------+

And after upgrading db1090:3317, it gets fixed:

root@db1090.eqiad.wmnet[eswiki]> select @@hostname; select version();
+------------+
| @@hostname |
+------------+
| db1090     |
+------------+
1 row in set (0.00 sec)

+-----------------+
| version()       |
+-----------------+
| 10.1.37-MariaDB |
+-----------------+
1 row in set (0.00 sec)

root@db1090.eqiad.wmnet[eswiki]> explain SELECT * FROM revision WHERE (rev_timestamp < '20180613142145' OR rev_timestamp = '20180613142145' AND rev_id <= 845694425) AND rev_page = 6097020 AND NOT(rev_user = 2619640)  ORDER BY rev_timestamp DESC,rev_id DESC LIMIT 101;
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
| id   | select_type | table    | type  | possible_keys                                                                       | key            | key_len | ref  | rows | Extra       |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
|    1 | SIMPLE      | revision | range | PRIMARY,rev_timestamp,page_timestamp,user_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 24      | NULL |    9 | Using where |
+------+-------------+----------+-------+-------------------------------------------------------------------------------------+----------------+---------+------+------+-------------+
1 row in set (0.03 sec)
Marostegui claimed this task.

All the core replicas that receive this query are now running > 10.1.36 which doesn't have this optimizer "bug".
The masters aren't running those version, but they are not receiving (or shouldn't be) this queries so this is pretty much solved.