Page MenuHomePhabricator

Slow query ApiQueryRevisions on enwiki
Closed, ResolvedPublic

Description

While troubleshooting T223952: Increased instability in MediaWiki backends (according to load balancers) I noticed the following slow query on enwiki (we don't know if it is related, but it needs fixing anyways, hence this task)

Note the explain and then the real execution plan

1root@db1089.eqiad.wmnet[enwiki]> explain SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2 /* d4b1615c1ca7cc6e58b19616c200b731 db1118 enwiki 4s */;
2+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
3| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
4+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
5| 1 | SIMPLE | revision | index | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16 | NULL | 12088 | Using where |
6| 1 | SIMPLE | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.revision.rev_page | 1 | |
7| 1 | SIMPLE | temp_rev_comment | ref | PRIMARY,revcomment_rev | PRIMARY | 4 | enwiki.revision.rev_id | 1 | Using index |
8| 1 | SIMPLE | comment_rev_comment | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.temp_rev_comment.revcomment_comment_id | 1 | |
9| 1 | SIMPLE | temp_rev_user | ref | PRIMARY,revactor_rev,actor_timestamp | PRIMARY | 4 | enwiki.revision.rev_id | 1 | Using index |
10| 1 | SIMPLE | actor_rev_user | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.temp_rev_user.revactor_actor | 1 | |
11| 1 | SIMPLE | user | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.actor_rev_user.actor_user | 1 | Using where |
12+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
137 rows in set (0.00 sec)
14
15
16
17
18root@db1089.eqiad.wmnet[enwiki]> show explain for 9049892927;
19+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
20| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
21+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
22| 1 | SIMPLE | revision | index | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16 | const | 817012512 | Using where; Using filesort |
23| 1 | SIMPLE | page | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.revision.rev_page | 1 | |
24| 1 | SIMPLE | temp_rev_comment | ref | PRIMARY,revcomment_rev | PRIMARY | 4 | enwiki.revision.rev_id | 1 | Using index |
25| 1 | SIMPLE | comment_rev_comment | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.temp_rev_comment.revcomment_comment_id | 1 | |
26| 1 | SIMPLE | temp_rev_user | ref | PRIMARY,revactor_rev,actor_timestamp | PRIMARY | 4 | enwiki.revision.rev_id | 1 | Using index |
27| 1 | SIMPLE | actor_rev_user | eq_ref | PRIMARY | PRIMARY | 8 | enwiki.temp_rev_user.revactor_actor | 1 | |
28| 1 | SIMPLE | user | eq_ref | PRIMARY | PRIMARY | 4 | enwiki.actor_rev_user.actor_user | 1 | Using where |
29+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
307 rows in set, 1 warning (0.01 sec)

The query takes around 8 seconds to run:

2 rows in set (8.43 sec)

That query arriving to the three enwiki API DBs (as expected): db1080, db1118,db1119

When using page_timestamp index, the query flies:

root@db1134.eqiad.wmnet[enwiki]> explain SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2 /* d4b1615c1ca7cc6e58b19616c200b731 db1118 enwiki 4s */;
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
| id   | select_type | table               | type   | possible_keys                                          | key            | key_len | ref                                           | rows   | Extra       |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 4       | const                                         | 151648 | Using where |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.revision.rev_page                      |      1 |             |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY        | 4       | enwiki.revision.rev_id                        |      1 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY        | 8       | enwiki.temp_rev_comment.revcomment_comment_id |      1 |             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                   | PRIMARY        | 4       | enwiki.revision.rev_id                        |      1 | Using index |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                | PRIMARY        | 8       | enwiki.temp_rev_user.revactor_actor           |      1 |             |
|    1 | SIMPLE      | user                | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.actor_rev_user.actor_user              |      1 | Using where |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
7 rows in set (0.00 sec)
root@db1134.eqiad.wmnet[enwiki]> SELECT /* ApiQueryRevisions::run */ rev_id, rev_page.....snipped
2 rows in set (0.00 sec)

Event Timeline

For the record, @Anomie has checked the query on db1083 and it works fine there.
I am running an analyze table on db2103 for revision and let's see if the query plan changes once the indexes stats have been updated.

The analyze table didn't help on db2103 so we need to think about different strategy to get this query to run faster:

+-----------------+---------+----------+-----------------------------------------+
| Table           | Op      | Msg_type | Msg_text                                |
+-----------------+---------+----------+-----------------------------------------+
| enwiki.revision | analyze | status   | Engine-independent statistics collected |
| enwiki.revision | analyze | status   | OK                                      |
+-----------------+---------+----------+-----------------------------------------+
2 rows in set (5 hours 17 min 49.98 sec)
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key           | key_len | ref                                           | rows      | Extra                       |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
|    1 | SIMPLE      | revision            | index  | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16      | const                                         | 830575333 | Using where; Using filesort |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY       | 4       | enwiki.revision.rev_page                      |         1 |                             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                   | PRIMARY       | 4       | enwiki.revision.rev_id                        |         1 | Using index                 |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                | PRIMARY       | 8       | enwiki.temp_rev_user.revactor_actor           |         1 |                             |
|    1 | SIMPLE      | user                | eq_ref | PRIMARY                                                | PRIMARY       | 4       | enwiki.actor_rev_user.actor_user              |         1 | Using where                 |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY       | 4       | enwiki.revision.rev_id                        |         1 | Using index                 |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY       | 8       | enwiki.temp_rev_comment.revcomment_comment_id |         1 |                             |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+

It looks very random:
db1083 runs 10.1.37 and the query runs fine and db1080 runs 10.1.38 and the query runs badly.
It also works fine on db1134 (a recently new host which runs 10.1.39 and which was cloned from db1083 and works bad on db1089 (10.1.37) and db1106 (10.1.38), it seems very random....
The config is the same between db1083 and db1080.

This is the index the explain reports as being chosen on all the s1 hosts:

dbstore1003.eqiad.wmnet:3311
rev_timestamp
db2116.codfw.wmnet:3306
rev_timestamp
db2112.codfw.wmnet:3306
rev_timestamp
db2103.codfw.wmnet:3306
rev_timestamp
db2092.codfw.wmnet:3306
rev_timestamp
db2088.codfw.wmnet:3311
page_timestamp
db2085.codfw.wmnet:3311
page_timestamp
db2072.codfw.wmnet:3306
rev_timestamp
db2071.codfw.wmnet:3306
rev_timestamp
db2070.codfw.wmnet:3306
rev_timestamp
db2062.codfw.wmnet:3306
rev_timestamp
db2055.codfw.wmnet:3306
rev_timestamp
db2048.codfw.wmnet:3306
rev_timestamp
db1139.eqiad.wmnet:3311
rev_timestamp
db1134.eqiad.wmnet:3306
page_timestamp
db1119.eqiad.wmnet:3306
rev_timestamp
db1118.eqiad.wmnet:3306
rev_timestamp
db1106.eqiad.wmnet:3306
rev_timestamp
db1105.eqiad.wmnet:3311
page_timestamp
db1099.eqiad.wmnet:3311
page_timestamp
db1089.eqiad.wmnet:3306
rev_timestamp
db1083.eqiad.wmnet:3306
page_timestamp
db1080.eqiad.wmnet:3306
rev_timestamp
db1067.eqiad.wmnet:3306
rev_timestamp

I am running an: alter table revision engine=InnoDB, force; on db2103 to see if rebuilding the table makes any difference.

Change 511804 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1118 from API, pool db1134

https://gerrit.wikimedia.org/r/511804

Change 511804 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1118 from API, pool db1134

https://gerrit.wikimedia.org/r/511804

Mentioned in SAL (#wikimedia-operations) [2019-05-22T05:09:05Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Remove db1118 from s1 api and pool db1134 instead T224017 (duration: 00m 57s)

Rebuilding the table worked and now the query plan has changed to page_timestamp

root@db2103.codfw.wmnet[enwiki]> alter table revision engine=InnoDB, force;
Query OK, 0 rows affected (13 hours 52 min 47.92 sec)
Records: 0  Duplicates: 0  Warnings: 0

root@db2103.codfw.wmnet[enwiki]>
root@db2103.codfw.wmnet[enwiki]> explain  SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2 /* d4b1615c1ca7cc6e58b19616c200b731 db1118 enwiki 4s */;
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
| id   | select_type | table               | type   | possible_keys                                          | key            | key_len | ref                                           | rows   | Extra       |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 4       | const                                         | 153670 | Using where |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.revision.rev_page                      |      1 |             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                   | PRIMARY        | 4       | enwiki.revision.rev_id                        |      1 | Using index |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                | PRIMARY        | 8       | enwiki.temp_rev_user.revactor_actor           |      1 |             |
|    1 | SIMPLE      | user                | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.actor_rev_user.actor_user              |      1 | Using where |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY        | 4       | enwiki.revision.rev_id                        |      1 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY        | 8       | enwiki.temp_rev_comment.revcomment_comment_id |      1 |             |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
7 rows in set (0.04 sec)

And now the query is almost instant:

root@db2103.codfw.wmnet[enwiki]> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
root@db2103.codfw.wmnet[enwiki]> SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2 /* d4b1615c1ca7cc6e58b19616c200b731 db1118 enwiki 4s */;
2 rows in set (0.03 sec)

I am going to go ahead and rebuild the table on the enwiki s1 api hosts, but ideally this should happen on all enwiki hosts really.

Change 511860 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1080

https://gerrit.wikimedia.org/r/511860

Marostegui triaged this task as High priority.
Marostegui edited projects, added DBA; removed User-Marostegui.
Marostegui moved this task from Triage to In progress on the DBA board.

Change 511860 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1080

https://gerrit.wikimedia.org/r/511860

Mentioned in SAL (#wikimedia-operations) [2019-05-22T12:29:27Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1080 to rebuild revision table T224017 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2019-05-22T12:34:01Z] <marostegui> Stop replication on db1080 to rebuild revision table - T224017

Mentioned in SAL (#wikimedia-operations) [2019-05-22T12:39:17Z] <marostegui> Stop replication on db2048 (s1 codfw master) to rebuild revision table - this will generate lag on codfw - T224017

@Anomie bad news again, db1080 didn't change its query plan after rebuilding the table - I am now running an analyze there again just to have the same combination of things we ran on db2103 and that made the query plan to change and pick the right index.

@Anomie db1080 didn't change its query plan after: rebuilding the table, analyzing the table, stopping and starting mysql (which is odd, as it worked fine on db2103, so the optimizer is definitely being silly)
I am afraid we might need to go for IGNORE INDEX for rev_timestamp?

root@db1080.eqiad.wmnet[enwiki]> explain SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name FROM `revision` JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2 /* d4b1615c1ca7cc6e58b19616c200b731 db1118 enwiki 4s */;
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
| id   | select_type | table               | type   | possible_keys                                          | key           | key_len | ref                                           | rows  | Extra       |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
|    1 | SIMPLE      | revision            | index  | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16      | NULL                                          | 10369 | Using where |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY       | 4       | enwiki.revision.rev_page                      |     1 |             |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY       | 4       | enwiki.revision.rev_id                        |     1 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY       | 8       | enwiki.temp_rev_comment.revcomment_comment_id |     1 |             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                   | PRIMARY       | 4       | enwiki.revision.rev_id                        |     1 | Using index |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                | PRIMARY       | 8       | enwiki.temp_rev_user.revactor_actor           |     1 |             |
|    1 | SIMPLE      | user                | eq_ref | PRIMARY                                                | PRIMARY       | 4       | enwiki.actor_rev_user.actor_user              |     1 | Using where |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-------+-------------+
7 rows in set (0.00 sec)

This is a show explain

+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key           | key_len | ref                                           | rows      | Extra                       |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+
|    1 | SIMPLE      | revision            | index  | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16      | const                                         | 830807594 | Using where; Using filesort |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY       | 4       | enwiki.revision.rev_page                      |         1 |                             |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY       | 4       | enwiki.revision.rev_id                        |         1 | Using index                 |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY       | 8       | enwiki.temp_rev_comment.revcomment_comment_id |         1 |                             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                   | PRIMARY       | 4       | enwiki.revision.rev_id                        |         1 | Using index                 |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                | PRIMARY       | 8       | enwiki.temp_rev_user.revactor_actor           |         1 |                             |
|    1 | SIMPLE      | user                | eq_ref | PRIMARY                                                | PRIMARY       | 4       | enwiki.actor_rev_user.actor_user              |         1 | Using where                 |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-----------------------------+

Change 512106 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1080

https://gerrit.wikimedia.org/r/512106

With the IGNORE INDEX it works instantly:

root@db1080.eqiad.wmnet[enwiki]> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
root@db1080.eqiad.wmnet[enwiki]> SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name  FROM `revision` IGNORE INDEX (rev_timestamp) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2 /* d4b1615c1ca7cc6e58b19616c200b731 db1118 enwiki 4s */;
2 rows in set (0.00 sec)

Change 512106 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1080

https://gerrit.wikimedia.org/r/512106

I got slightly nerd-sniped into this and had a look. First I found that the optimizer choose a full table scan even without joins, as soon as we select a field from the revision table that is not included in the index.

So for example this query is immediate:

MariaDB [enwiki]> SELECT rev_id, rev_page, rev_timestamp FROM `revision` WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;
2 rows in set (0.00 sec)

While this one shows the same issue:

MariaDB [enwiki]> SELECT rev_id, rev_page, rev_timestamp, rev_minor_edit FROM `revision` WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;
2 rows in set (5.61 sec)

and its explain for is:

MariaDB [(none)]> show explain for 29769;
+------+-------------+----------+-------+------------------------------------------------+---------------+---------+-------+-----------+-----------------------------+
| id   | select_type | table    | type  | possible_keys                                  | key           | key_len | ref   | rows      | Extra                       |
+------+-------------+----------+-------+------------------------------------------------+---------------+---------+-------+-----------+-----------------------------+
|    1 | SIMPLE      | revision | index | page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16      | const | 830807594 | Using where; Using filesort |
+------+-------------+----------+-------+------------------------------------------------+---------------+---------+-------+-----------+-----------------------------+

Also removing the order by rev_id DESC doesn't change the results.

Instead of realizing that there is a limit that it would be so much better to select on the index and then go read the table for the selected elements.

Then I thought to count how many rows that ID matches, and I got:

MariaDB [enwiki]> select count(*) from revision where rev_page = '35159700';
+----------+
| count(*) |
+----------+
|    80040 |
+----------+

So I start thinking that this might be related to the return set. I tried other IDs that have much less revisions and the query works fine. I checked with @Marostegui and seems that we got only this specific ID failing, so it might explain that the issue is only when there are many revisions.

I've also had a theory that this might be related to the number of approximate rows reported by the explain in the different hosts, given that the first bits of the count and the number of rows are similar, see P8557. But together with @Marostegui so far we couldn't confirm this theory.

[EDIT] note that the Paste above was done with another query that is a full table scan by definition everywhere

So I start thinking that this might be related to the return set. I tried other IDs that have much less revisions and the query works fine. I checked with @Marostegui and seems that we got only this specific ID failing, so it might explain that the issue is only when there are many revisions.

I've also had a theory that this might be related to the number of approximate rows reported by the explain in the different hosts, given that the first bits of the count and the number of rows are similar, see P8557. But together with @Marostegui so far we couldn't confirm this theory.

The theory doesn't match I have seen, as:
db1118 report less examined rows but still has the bad query plan
db2103 has more rows examined but works fine
db1083 behaves the same way

By looking at the query reports, it is always the same 'rev_page'= '35159700' this query doesn't seem to get reported as slow for any other value so far.
The mystery still remains about why some hosts (just a few) have the good query plan and others (the majority doesn't).
Also why one of them changed the query plan after rebuilding the table and the other one didn't.
I am putting up a bug to file to mariadb

Some more details:
'rev_page'= '35159700' is https://en.wikipedia.org/wiki/User:Cyberpower678/RfX_Report and that page has around 80k revisions

root@db2103.codfw.wmnet[enwiki]> select count(*) from revision where rev_page='35159700';
+----------+
| count(*) |
+----------+
|    80040 |
+----------+
1 row in set (0.07 sec)

Also:
https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/User:Cyberpower678/RfX_Report

To add more randomness: db2062 (10.1.38) finished its table rebuilt and now it uses page_timestamp instead of rev_timestmap as it used to, so the query goes fine.

root@db2062.codfw.wmnet[enwiki]> explain SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name  FROM `revision`  JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
| id   | select_type | table               | type   | possible_keys                                          | key            | key_len | ref                                           | rows   | Extra       |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
|    1 | SIMPLE      | revision            | ref    | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | page_timestamp | 4       | const                                         | 151376 | Using where |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.revision.rev_page                      |      1 |             |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY        | 4       | enwiki.revision.rev_id                        |      1 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY        | 8       | enwiki.temp_rev_comment.revcomment_comment_id |      1 |             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                   | PRIMARY        | 4       | enwiki.revision.rev_id                        |      1 | Using index |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                | PRIMARY        | 8       | enwiki.temp_rev_user.revactor_actor           |      1 |             |
|    1 | SIMPLE      | user                | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.actor_rev_user.actor_user              |      1 | Using where |
+------+-------------+---------------------+--------+--------------------------------------------------------+----------------+---------+-----------------------------------------------+--------+-------------+
7 rows in set (0.04 sec)

root@db2062.codfw.wmnet[enwiki]> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
root@db2062.codfw.wmnet[enwiki]> SELECT /* ApiQueryRevisions::run */ rev_id, rev_page, rev_timestamp, rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1, comment_rev_comment.comment_text AS `rev_comment_text`, comment_rev_comment.comment_data AS `rev_comment_data`, comment_rev_comment.comment_id AS `rev_comment_cid`, actor_rev_user.actor_user AS `rev_user`, actor_rev_user.actor_name AS `rev_user_text`, temp_rev_user.revactor_actor AS `rev_actor`, page_namespace, page_title, page_id, page_latest, page_is_redirect, page_len, user_name  FROM `revision`  JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `page` ON ((page_id = rev_page)) LEFT JOIN `user` ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '35159700' ORDER BY rev_timestamp DESC, rev_id DESC LIMIT 2;
2 rows in set (0.04 sec)

db2055 finished its rebuilt and its query plan changed to page_timestamp (10.1.38)
db2048 finished too and its query plan remains on rev_timestamp (10.1.37)

db2071 finished its rebuilt and its query plan changed to page_timestamp (10.1.38)
db2092 finished its rebuilt and its query plan changed to page_timestamp (10.1.38)

Change 512191 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] API: Add IGNORE INDEX to avoid bad plan in ApiQueryRevisions

https://gerrit.wikimedia.org/r/512191

db2112 finished its rebuilt and its query plan remained on rev_timestamp (10.1.39)
db2085:311 finished its rebuilt and its query plan changed to page_timestamp (10.1.38).

To sum up all the hosts that got the table rebuilt, their version and the result:

db1080 (10.1.38) plan remained on rev_timestamp (later upgraded to 10.1.39 and nothing changed)
db2048 (10.1.37) plan remained on rev_timestamp
db2055 (10.1.38) plan changed to page_timestamp
db2071 (10.1.38) plan changed to page_timestamp
db2085:3311 (10.1.38) plan changed to page_timestamp
db2092 (10.1.38) plan changed to page_timestamp
db2103 (10.1.39) plan changed to page_timestmap
db2112 (10.1.39) plan remained on rev_timestamp

Change 512191 merged by jenkins-bot:
[mediawiki/core@master] API: Add IGNORE INDEX to avoid bad plan in ApiQueryRevisions

https://gerrit.wikimedia.org/r/512191

I have been talking to some MariaDB people that work on the optimizer here at Percona Live, and they've given me some ideas to test.
They are mostly related to analyze persistent and to play around with optimizer_use_condition_selectivity.
The optimizer variable looks good, but obviously they might trigger unexpected query plan changes (generally, MariaDB believes that will be better, but it is not guaranteed). So I want to play around with them and see how they impact that query. However, I discard doing changes globally as the impact might be unexpected and it is not worth risking it just to save this (or some other few queries like T71222#4998012).

There are lots of improvements on the optimizer on 10.4 (aside from the optimizer trace itself which will be a great debugging tool) and from my conversations here with them it looks like they are not going to do any bugfixing or backporting stuff to 10.1 in regards with the optimizer (unless there is some VERY critical things).
So I will play around with db2112.

This will of course get fixed on code with the merged patch at T224017#5211575

(aside from the optimizer trace itself which will be a great debugging tool)

!!

I look forward to seeing that in action, it sounds like it could be extremely useful for these sorts of bugs.

(aside from the optimizer trace itself which will be a great debugging tool)

!!

I look forward to seeing that in action, it sounds like it could be extremely useful for these sorts of bugs.

Yeah, but it is on 10.4 :(

Change 512191 merged by jenkins-bot:
[mediawiki/core@master] API: Add IGNORE INDEX to avoid bad plan in ApiQueryRevisions

https://gerrit.wikimedia.org/r/512191

@Anomie this is live now, right? According to: https://tools.wmflabs.org/versions/

@Anomie this is live now, right? According to: https://tools.wmflabs.org/versions/

Yes, it is. Checking the logs on mwlog1001, I see the messages in hhvm.log about this query stopped at the same time as the SAL entry for all wikis going to wmf.7.

Marostegui added a project: User-Marostegui.

Yeah, I cannot see it on tendril either.
Closing this then, I will play a bit with the optimizer and report back the results just for posterity.