Page MenuHomePhabricator

Databases overflown with connections due to slow query on Special:AllPages
Closed, ResolvedPublic

Description

Query likely responsible:

Hits 	Tmax 	Tavg 	Tsum 	Hosts 	Users 	Schemas
3389	325	135	459,737	db1051, db1055, db1066, db1072, db1073, db1079, db1080, db1083, db1086, db1089	wikiuser	enwiki, eswiki
SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` WHERE page_namespace = '0' AND page_is_redirect = '0' AND (page_title < 'Entomacrodus_stellifer') ORDER BY page_title DESC LIMIT 344, 1 /* bff89ebd2be06faa4ae5c18dcc972d02 db1079 eswiki 5s */

Event Timeline

jcrespo created this task.Mar 20 2017, 2:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2017, 2:50 PM
hashar added a subscriber: hashar.Mar 20 2017, 2:51 PM

Actions I have done after SWAT:

14:22	<hashar@tin>	Synchronized php-1.29.0-wmf.16/includes/specials/SpecialWatchlist.php: reverts commit SpecialWatchlist.php 0d675d29d925bda7b42958f99b5e3cac3f78c8a3 (duration: 00m 43s)

The change is probably harmless but got deployed about the time of the issue started. So safe action is to revert first and think later.

14:36 <hashar> Disabled Special:AllPages output from all wikis via https://gerrit.wikimedia.org/r/#/c/343647/1/includes/specials/SpecialAllPages.php

That was following discussion about SpecialAllPages::showChunk() overloading the data. The patch does not disable the api endpoint eg https://en.wikipedia.org/w/api.php?action=query&format=json&list=allpages&apprefix=E&apnamespace=14

root@db1083[enwiki]> EXPLAIN SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` WHERE page_namespace = '0' AND page_is_redirect = '0' AND (page_title < 'Entomacrodus_stellifer') ORDER BY page_title DESC LIMIT 344, 1 /* bff89ebd2be06faa4ae5c18dcc972d02 db1079 eswiki 5s */\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: ref
possible_keys: name_title,page_redirect_namespace_len
          key: name_title
      key_len: 4
          ref: const
         rows: 9913258
        Extra: Using where

The query in action:

root@db1083[enwiki]> FLUSH STATUS; SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` WHERE page_namespace = '0' AND page_is_redirect = '0' AND (page_title < 'Entomacrodus_stellifer') ORDER BY page_title DESC LIMIT 344, 1;
Query OK, 0 rows affected (0.00 sec)

+----------------------+
| page_title           |
+----------------------+
| Enterprise_of_Ulster |
+----------------------+
1 row in set (23.89 sec)

root@db1083[enwiki]> SHOW STATUS like 'Hand%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 0       |
| Handler_icp_match          | 0       |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 0       |
| Handler_read_key           | 1       |
| Handler_read_last          | 0       |
| Handler_read_next          | 0       |
| Handler_read_prev          | 9223884 |
| Handler_read_retry         | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 0       |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 0       |
| Handler_tmp_write          | 0       |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+
26 rows in set (0.01 sec)

My honest opinion is that such a query, with the table's current structure, is not possible and should not be allowed (as it queries *potentially* millions of records (it could be 1 or it could be almost all). There is no an appropriate index fur such a query.

Compare instead, with the same query without the redirect check, which can use the indexes efficiently:

root@db1083[enwiki]> SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` WHERE page_namespace = '0' AND (page_title > 'EFGaming') ORDER BY page_title DESC LIMIT 344, 1;
+------------+
| page_title |
+------------+
| <U+1F62A>           |
+------------+
1 row in set (0.00 sec)

I would disable the possibility of not hiding redirects until an appropriate index is added.

Reedy added a subscriber: Reedy.Mar 20 2017, 3:44 PM

Should we be disabling the redirect filtering in the API too at the same time?

https://en.wikipedia.org/w/api.php?action=help&modules=query%2Ballpages

Reedy added a comment.Mar 20 2017, 4:09 PM

T160920 filed for a way to throttle simultaneous requests to potentially expensive, non cached Special Pages

Anomie added a subscriber: Anomie.Mar 20 2017, 4:15 PM

It looks like another case where the planner is making a strange decision. It should be able to use the name_title btree index to satisfy this particular query while examining only 767 rows, but instead it really does examine 2 million:

mysql:wikiadmin@db1079 [eswiki]> select page_title from page where page_namespace='0' and page_is_redirect='0' and page_title < 'Entomacrodus_stellifer' ORDER BY page_title DESC limit 344, 1;
+-------------------+
| page_title        |
+-------------------+
| Ensifera_ensifera |
+-------------------+
1 row in set (5.83 sec)

mysql:wikiadmin@db1079 [eswiki]> SHOW STATUS like 'Hand%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 0       |
| Handler_icp_match          | 0       |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 0       |
| Handler_read_key           | 1       |
| Handler_read_last          | 0       |
| Handler_read_next          | 0       |
| Handler_read_prev          | 2007451 |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 143     |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 0       |
| Handler_tmp_write          | 141     |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+

If we change the query slightly, it suddenly does only examine the correct number of rows:

mysql:wikiadmin@db1079 [eswiki]> select page_title from page where page_namespace='0' and page_is_redirect='0' and page_title < 'Entomacrodus_stellifer' AND page_title >= 'E' ORDER BY page_title DESC limit 344, 1;
+-------------------+
| page_title        |
+-------------------+
| Ensifera_ensifera |
+-------------------+
1 row in set (0.01 sec)

mysql:wikiadmin@db1079 [eswiki]> SHOW STATUS like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 766   |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 143   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 141   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+

The plan is also slightly different:

mysql:wikiadmin@db1079 [eswiki]> explain select page_title from page where page_namespace='0' and page_is_redirect='0' and page_title < 'Entomacrodus_stellifer' AND page_title >= 'E' ORDER BY page_title DESC limit 344, 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: range
possible_keys: name_title,page_redirect_namespace_len
          key: name_title
      key_len: 261
          ref: NULL
         rows: 282528
        Extra: Using where

Comparing this with the previous plan, apparently in the bad case it's deciding it needs to only use the page_namespace part of the index to fetch every row in the namespace and then filter on both page_title and page_is_redirect (thus the "key_len: 4"), while with the modified query it correctly uses the whole index ("key_len: 261") and filters only by page_is_redirect, stopping once it has found enough matching rows to meet the limit. I note that changing it to LIMIT 345 (i.e. removing the offset) or even LIMIT 1 has the same bad plan, while removing the page_is_redirect='0' condition somehow unconfuses it. Changing page_is_redirect to page_is_new also has the bad plan, so it seems unlikely to be related to the existence of the page_redirect_namespace_len index.

Also, BTW, changing the < to > and changing the ordering to ASC seems to unconfuse it for page_is_redirect but not page_is_new for some reason.

I wonder whether any of the changes described in https://mariadb.com/kb/en/mariadb/improvements-to-order-by/ might help us with some of these weird planner things we've been running into lately.

@Anomie - that is not right. The table works well in some cases, maybe most cases, but it still does not work because of the page_is_redirect. This surfaced when *all* links were scrapped.

root@db1083[enwiki]> SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` USE INDEX(name_title) WHERE page_namespace = '0' AND page_is_redirect = '1' AND (page_title > 'Entomacrodus_stellifer') ORDER BY page_namespace, page_title LIMIT 344, 1;
+------------------------------+
| page_title                   |
+------------------------------+
| Entre_la_coupe_et_l'election |
+------------------------------+
1 row in set (25.59 sec)

An index containing that flag is needed if it has to be fast. Otherwise, when there are few results, it will read lots of not-returned rows. I would disable the flag until such index is added.

jcrespo added a comment.EditedMar 20 2017, 4:36 PM

The ORDER BY optimization will most certainly help for other cases, like T159319- I've mentioned those on 5.6/5.7, but I do not think it is the main cause here.

True, there is the possibility of T97797-like issues with this query in general, and we should probably remove the redirect filtering at least for Special:AllPages in miser mode.

But this specific query shouldn't have that issue, page_is_redirect=0 is not that sparse on either eswiki or enwiki for the subset of pages these queries are actually looking at.

Your query in T160914#3114917 is running into a different bug, where including page_namespace in the ORDER BY when it's constant in the WHERE makes the planner get confused and think it has to fetch all rows and filesort. Removing that makes it work better, if still not the best:

mysql:wikiadmin@db1083 [enwiki]> SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` USE INDEX(name_title) WHERE page_namespace = '0' AND page_is_redirect = '1' AND (page_title > 'Entomacrodus_stellifer') ORDER BY page_title LIMIT 344, 1;
+------------------------------+
| page_title                   |
+------------------------------+
| Entre_la_coupe_et_l'election |
+------------------------------+
1 row in set (0.68 sec)

mysql:wikiadmin@db1083 [enwiki]> SHOW STATUS like 'Hand%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 3871743 |
| Handler_icp_match          | 608     |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 0       |
| Handler_read_key           | 1       |
| Handler_read_last          | 0       |
| Handler_read_next          | 604     |
| Handler_read_prev          | 0       |
| Handler_read_retry         | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 149     |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 0       |
| Handler_tmp_write          | 147     |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+

It's still using a bad plan, but for ascending order it seems it's willing to use ICP and stop after actually finding enough rows to fill the limit, which speeds things up a lot compared to the descending order where it's actually fetching all those rows.

Removing the USE INDEX too makes it use the good plan. I have no idea why the USE INDEX was making it choose the bad plan.

mysql:wikiadmin@db1083 [enwiki]> SELECT /* SpecialAllPages::showChunk */ page_title FROM `page` WHERE page_namespace = '0' AND page_is_redirect = '1' AND (page_title > 'Entomacrodus_stellifer') ORDER BY page_title LIMIT 344, 1;
+------------------------------+
| page_title                   |
+------------------------------+
| Entre_la_coupe_et_l'election |
+------------------------------+
1 row in set (0.00 sec)

mysql:wikiadmin@db1083 [enwiki]> SHOW STATUS like 'Hand%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 1     |
| Handler_read_last          | 0     |
| Handler_read_next          | 604   |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 149   |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 147   |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
Anomie added a comment.EditedMar 20 2017, 5:13 PM

Huh. Now, playing with it a bit more I'm seeing shades of a different bug: changing it to page_namespace = 0 (unquoted integer) makes it not filesort even with ORDER BY page_namespace, page_title. I wonder if there was a partial fix of the old bug or if I just never noticed that when investigating the old bug in the first place.

Playing with that a bit more, using SELECT pl_namespace, pl_title, pl_from, page_id, page_namespace, page_title FROM pagelinks, page WHERE pl_from = page_id AND pl_namespace='0' AND pl_title='Main_Page' ORDER BY pl_namespace, pl_title, pl_from LIMIT 11;

  • pl_namespace='0' AND pl_title='Main_Page' filesorts
  • pl_namespace=0 AND pl_title='Main_Page' filesorts
  • pl_namespace='0' AND pl_title=BINARY 'Main_Page' filesorts
  • pl_namespace=0 AND pl_title=BINARY 'Main_Page' doesn't filesort!
  • pl_namespace IN ('0', '1') AND pl_title=BINARY 'Main_Page' doesn't filesort!
  • pl_namespace=0 AND pl_title IN ('Main_Page', 'Test') doesn't filesort!
  • pl_namespace IN ('0', '1') AND pl_title IN ('Main_Page', 'Test') doesn't filesort (which I knew already)

@Anomie I think you got it what I meant earlier. Thank you! :-)

I know you are not supposed to work on this (although it probably affects api, too), but I am going off- could I ask you to move it on some of the mediawiki devs and ask for help to have some better workaround soon?

We have a -probably not so important, but a- mediawiki functionality disabled on all wikis because of this (I took that decision), so while it is not an unbreak now, it in the closest to it.

Reedy added a subscriber: Tgr.Mar 20 2017, 5:43 PM

Ah, OK, the redirect filter. Just remove it? Unindexed queries like that should not be exposed in miser mode.

Should be trivial

Current status: @Reedy submitted a patch to disable the filtering in Special:AllPages in miser mode, and to apply the "limited in miser mode" logic (i.e. make the query without the page_is_direct = 0 then filter it in PHP) to the API apfilterredir parameter. The patch is merged and backported, and will be deployed soon if it hasn't been already.

I'm not going to close this bug just yet in case we still want to discuss the bad plans, or the possibility of adding an index, or anything else.

Reedy added a comment.Mar 20 2017, 8:21 PM

Deployed now!

demon added a subscriber: demon.Mar 20 2017, 8:42 PM

Does this task need to be private anymore?

Does this task need to be private anymore?

I don't think so. There's nothing private, and the redirect filter has been disabled so there's no longer a DOS issue that someone could exploit. Being bold and making it public.

Bawolff changed the visibility from "Custom Policy" to "Public (No Login Required)".Mar 21 2017, 8:13 AM
jcrespo closed this task as Resolved.Mar 21 2017, 10:27 AM
jcrespo assigned this task to Reedy.

I'm not going to close this bug just yet in case we still want to discuss the bad plans, or the possibility of adding an index, or anything else.

We should discuss that, but on a separate ticket- the scope of this one was for the incident only. Databases are no longer overflown, and there is a proper (even if undesired) patch in place. We need to discuss a followup. I created the following followups:

I think Anomie meant only the first one, but I invite everybody to provide input on all of them (some of them may be undesired or already done).

BTW, I am nominatively assigning this to Reedy because he wrote the patch, but of course this was a group effort. Thank you to everybody that contributed.