Page MenuHomePhabricator

Special:ProtectedPages times out on enwiki for Module namespace
Closed, ResolvedPublicPRODUCTION ERROR

Description

Selecting the Module namespace (828) on Special:ProtectedPages (https://en.wikipedia.org/wiki/Special:ProtectedPages?namespace=828) times out on enwiki. Other namespaces work fine, including 10, and the search works elsewhere on a random smattering of wikis (meta, de.wikipedia, fr.wikiquote, he.wikipedia).

Error message is

PHP fatal error:
entire web request took longer than 60 seconds and timed out

Event Timeline

I believe this is the query, which is too slow and gets killed by the query killer:

root@db1106.eqiad.wmnet[enwiki]> explain SELECT /* IndexPager::buildQueryInfo (ProtectedPagesPager) */ pr_id, page_namespace, page_title, page_len, pr_type, pr_level, pr_expiry, pr_cascade, log_timestamp, log_deleted, comment_log_comment.comment_text AS `log_comment_text`, comment_log_comment.comment_data AS `log_comment_data`, comment_log_comment.comment_id AS `log_comment_cid`, log_user, log_user_text, NULL AS `log_actor` FROM `page`, `page_restrictions` LEFT JOIN `log_search` ON (ls_field = 'pr_id' AND (ls_value = pr_id)) LEFT JOIN (`logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id))) ON ((ls_log_id = log_id)) WHERE (pr_expiry > '20190214194211' OR pr_expiry IS NULL) AND (page_id=pr_page) AND (pr_type='edit') AND (page_namespace='828') ORDER BY pr_id LIMIT 101 ;
+------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+--------+---------------------------------+
| id   | select_type | table               | type   | possible_keys                | key        | key_len | ref                           | rows   | Extra                           |
+------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+--------+---------------------------------+
|    1 | SIMPLE      | page                | ref    | PRIMARY,name_title           | name_title | 4       | const                         |  25438 | Using temporary; Using filesort |
|    1 | SIMPLE      | page_restrictions   | eq_ref | PRIMARY,pr_page,pr_typelevel | PRIMARY    | 261     | enwiki.page.page_id,const     |      1 | Using where                     |
|    1 | SIMPLE      | log_search          | ref    | PRIMARY                      | PRIMARY    | 34      | const                         | 293280 | Using where; Using index        |
|    1 | SIMPLE      | logging             | eq_ref | PRIMARY                      | PRIMARY    | 4       | enwiki.log_search.ls_log_id   |      1 | Using where                     |
|    1 | SIMPLE      | comment_log_comment | eq_ref | PRIMARY                      | PRIMARY    | 8       | enwiki.logging.log_comment_id |      1 |                                 |
+------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+--------+---------------------------------+
5 rows in set (0.00 sec)

The query itself takes almost 2 minutes to finish:

101 rows in set (1 min 53.52 sec)

Is this a new query? Did something get release today?

db1106 also happens to be the only 10.1.38 server, could it be also a regression on MariaDB?

Discarded, the plan is the same on db1089 (10.1.37) and db1119 (10.1.36).

A new version of MediaWiki was deployed to enwiki at 20:14 UTC today, but I don't see any changes in the code related to making this query. Do we know that this worked previously, or is it possible that no one tried using this special page with enwiki's Module namespace recently?

Comparing with the same query for a different namespace:

wikiadmin@10.64.32.115(enwiki)> explain SELECT /* IndexPager::buildQueryInfo (ProtectedPagesPager) */ pr_id, page_namespace, page_title, page_len, pr_type, pr_level, pr_expiry, pr_cascade, log_timestamp, log_deleted, comment_log_comment.comment_text AS `log_comment_text`, comment_log_comment.comment_data AS `log_comment_data`, comment_log_comment.comment_id AS `log_comment_cid`, log_user, log_user_text, NULL AS `log_actor` FROM `page`, `page_restrictions` LEFT JOIN `log_search` ON (ls_field = 'pr_id' AND (ls_value = pr_id)) LEFT JOIN (`logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id))) ON ((ls_log_id = log_id)) WHERE (pr_expiry > '20190214194211' OR pr_expiry IS NULL) AND (page_id=pr_page) AND (pr_type='edit') AND (page_namespace='10') ORDER BY pr_id LIMIT 101 ;
+------+-------------+---------------------+--------+------------------------------+---------+---------+----------------------------------+--------+--------------------------+
| id   | select_type | table               | type   | possible_keys                | key     | key_len | ref                              | rows   | Extra                    |
+------+-------------+---------------------+--------+------------------------------+---------+---------+----------------------------------+--------+--------------------------+
|    1 | SIMPLE      | page_restrictions   | index  | PRIMARY,pr_page,pr_typelevel | pr_id   | 4       | NULL                             |      2 | Using where              |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY,name_title           | PRIMARY | 4       | enwiki.page_restrictions.pr_page |      1 | Using where              |
|    1 | SIMPLE      | log_search          | ref    | PRIMARY                      | PRIMARY | 34      | const                            | 409388 | Using where; Using index |
|    1 | SIMPLE      | logging             | eq_ref | PRIMARY                      | PRIMARY | 4       | enwiki.log_search.ls_log_id      |      1 | Using where              |
|    1 | SIMPLE      | comment_log_comment | eq_ref | PRIMARY                      | PRIMARY | 8       | enwiki.logging.log_comment_id    |      1 |                          |
+------+-------------+---------------------+--------+------------------------------+---------+---------+----------------------------------+--------+--------------------------+

That makes a bit more sense, at least there's no filesort. And if we force that plan for the Module namespace, it runs much quicker:

wikiadmin@10.64.32.115(enwiki)> explain SELECT /* IndexPager::buildQueryInfo (ProtectedPagesPager) */ pr_id, page_namespace, page_title, page_len, pr_type, pr_level, pr_expiry, pr_cascade, log_timestamp, log_deleted, comment_log_comment.comment_text AS `log_comment_text`, comment_log_comment.comment_data AS `log_comment_data`, comment_log_comment.comment_id AS `log_comment_cid`, log_user, log_user_text, NULL AS `log_actor` FROM `page`, `page_restrictions` force index (pr_id) LEFT JOIN `log_search` ON (ls_field = 'pr_id' AND (ls_value = pr_id)) LEFT JOIN (`logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id))) ON ((ls_log_id = log_id)) WHERE (pr_expiry > '20190214194211' OR pr_expiry IS NULL) AND (page_id=pr_page) AND (pr_type='edit') AND (page_namespace='828') ORDER BY pr_id LIMIT 101 ;
+------+-------------+---------------------+--------+--------------------+---------+---------+----------------------------------+--------+--------------------------+
| id   | select_type | table               | type   | possible_keys      | key     | key_len | ref                              | rows   | Extra                    |
+------+-------------+---------------------+--------+--------------------+---------+---------+----------------------------------+--------+--------------------------+
|    1 | SIMPLE      | page_restrictions   | index  | NULL               | pr_id   | 4       | NULL                             |      1 | Using where              |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY,name_title | PRIMARY | 4       | enwiki.page_restrictions.pr_page |      1 | Using where              |
|    1 | SIMPLE      | log_search          | ref    | PRIMARY            | PRIMARY | 34      | const                            | 409388 | Using where; Using index |
|    1 | SIMPLE      | logging             | eq_ref | PRIMARY            | PRIMARY | 4       | enwiki.log_search.ls_log_id      |      1 | Using where              |
|    1 | SIMPLE      | comment_log_comment | eq_ref | PRIMARY            | PRIMARY | 8       | enwiki.logging.log_comment_id    |      1 |                          |
+------+-------------+---------------------+--------+--------------------+---------+---------+----------------------------------+--------+--------------------------+

[...]

101 rows in set (13.91 sec)

Although 14 seconds still isn't all that great.


Looking back at the original query,

wikiadmin@10.64.32.115(enwiki)> show status like 'Hand%';
+----------------------------+-----------+
| Variable_name              | Value     |
+----------------------------+-----------+
| Handler_commit             | 2         |
| Handler_delete             | 0         |
| Handler_discover           | 0         |
| Handler_external_lock      | 0         |
| Handler_icp_attempts       | 0         |
| Handler_icp_match          | 0         |
| Handler_mrr_init           | 0         |
| Handler_mrr_key_refills    | 0         |
| Handler_mrr_rowid_refills  | 0         |
| Handler_prepare            | 0         |
| Handler_read_first         | 0         |
| Handler_read_key           | 13137     |
| Handler_read_last          | 0         |
| Handler_read_next          | 319549197 |
| Handler_read_prev          | 0         |
| Handler_read_retry         | 0         |
| Handler_read_rnd           | 101       |
| Handler_read_rnd_deleted   | 0         |
| Handler_read_rnd_next      | 1382      |
| Handler_rollback           | 0         |
| Handler_savepoint          | 0         |
| Handler_savepoint_rollback | 0         |
| Handler_tmp_update         | 0         |
| Handler_tmp_write          | 1379      |
| Handler_update             | 0         |
| Handler_write              | 0         |
+----------------------------+-----------+

319549197 rows for Handler_read_next is crazy, there are only 10005 rows in page that match page_namespace = 828, and only 1246 in page + page_restrictions that match the whole WHERE clause. And only 1146 rows in log_search that pass the JOIN for that table.

Hmm. The log_search line says key_len is 34, so maybe it's taking the 1246 rows from page + page_restrictions and doing a cross product with the 256453 rows from log_search with ls_field = 'pr_id', and only then filtering by ls_value = pr_id? That would be 319540438 rows.

What happens if we explicitly CAST(pr_id AS BINARY)? Will that make it use the whole index?

wikiadmin@10.64.32.115(enwiki)> explain SELECT /* IndexPager::buildQueryInfo (ProtectedPagesPager) */ pr_id, page_namespace, page_title, page_len, pr_type, pr_level, pr_expiry, pr_cascade, log_timestamp, log_deleted, comment_log_comment.comment_text AS `log_comment_text`, comment_log_comment.comment_data AS `log_comment_data`, comment_log_comment.comment_id AS `log_comment_cid`, log_user, log_user_text, NULL AS `log_actor` FROM `page`, `page_restrictions` LEFT JOIN `log_search` ON (ls_field = 'pr_id' AND (ls_value = CAST(pr_id AS binary))) LEFT JOIN (`logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id))) ON ((ls_log_id = log_id)) WHERE (pr_expiry > '20190214194211' OR pr_expiry IS NULL) AND (page_id=pr_page) AND (pr_type='edit') AND (page_namespace='828') ORDER BY pr_id LIMIT 101 ;
+------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+-------+---------------------------------+
| id   | select_type | table               | type   | possible_keys                | key        | key_len | ref                           | rows  | Extra                           |
+------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+-------+---------------------------------+
|    1 | SIMPLE      | page                | ref    | PRIMARY,name_title           | name_title | 4       | const                         | 23700 | Using temporary; Using filesort |
|    1 | SIMPLE      | page_restrictions   | eq_ref | PRIMARY,pr_page,pr_typelevel | PRIMARY    | 261     | enwiki.page.page_id,const     |     1 | Using where                     |
|    1 | SIMPLE      | log_search          | ref    | PRIMARY                      | PRIMARY    | 291     | const,func                    |     1 | Using where; Using index        |
|    1 | SIMPLE      | logging             | eq_ref | PRIMARY                      | PRIMARY    | 4       | enwiki.log_search.ls_log_id   |     1 | Using where                     |
|    1 | SIMPLE      | comment_log_comment | eq_ref | PRIMARY                      | PRIMARY    | 8       | enwiki.logging.log_comment_id |     1 |                                 |
+------+-------------+---------------------+--------+------------------------------+------------+---------+-------------------------------+-------+---------------------------------+

Yes, it does. How fast does that run?

101 rows in set (0.15 sec)

Wow! I guess relying on MySQL's auto-converting the integer pr_id to compare to the varbinary(255) ls_value was really biting us there, preventing it from fully using the index for the join. We might have to review that, see if any other query using $db->buildStringCast() would break if we made it actually cast in DatabaseMysqlBase (like it does for other databases).

wikiadmin@10.64.16.13(enwiki)> show status like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 13138 |
| Handler_read_last          | 0     |
| Handler_read_next          | 11152 |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 101   |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 1382  |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 1379  |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

That looks a lot more like it too, even with the filesort.

So Anomie did almost all the work, I was just going to say that it is slow for the 10 namespace too, just that it takes 10 seconds so it finishes, but bad equally. This is most likely to be not new issue, offsetted by the SSD hosts + the probable rareness of the call + query killer saving us.

Marostegui lowered the priority of this task from High to Medium.Feb 15 2019, 6:49 AM

Another brilliant analysis from @Anomie :-)
(Decreasing priority as this doesn't seem to happen very often as per: https://logstash.wikimedia.org/goto/4854d6d92b272ad88d23696570c7dad6)

Anomie claimed this task.

rMW7fbd971f2d8e: Database: Have buildStringCast() actually cast for MySQL, MSSQL should go out with wmf.19 next week, and should resolve this timeout. Feel free to reopen if that turns out not to be the case.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM