Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki
Open, Needs TriagePublic

Description

MariaDB MARIADB db1071 dewiki > EXPLAIN SELECT /* ApiQueryRecentChanges::run */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>='20161024013525') AND rc_namespace IN ('0', '120') AND rc_type IN ('0', '1', '3', '6') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: index
possible_keys: rc_timestamp,rc_namespace_title,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp
          key: rc_timestamp
      key_len: 16
          ref: NULL
         rows: 1239
        Extra: Using where
1 row in set (0.01 sec)

MariaDB MARIADB db1071 dewiki > FLUSH STATUS; pager cat > /dev/null; SELECT /* ApiQueryRecentChanges::run */ rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>='20161024013525') AND rc_namespace IN ('0', '120') AND rc_type IN ('0', '1', '3', '6') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101; nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.00 sec)

PAGER set to 'cat > /dev/null'
101 rows in set (1 min 14.06 sec)

PAGER set to stdout
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_commit             | 1       |
| Handler_delete             | 0       |
| Handler_discover           | 0       |
| Handler_external_lock      | 0       |
| Handler_icp_attempts       | 0       |
| Handler_icp_match          | 0       |
| Handler_mrr_init           | 0       |
| Handler_mrr_key_refills    | 0       |
| Handler_mrr_rowid_refills  | 0       |
| Handler_prepare            | 0       |
| Handler_read_first         | 1       |
| Handler_read_key           | 0       |
| Handler_read_last          | 0       |
| Handler_read_next          | 3343340 |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_deleted   | 0       |
| Handler_read_rnd_next      | 0       |
| Handler_rollback           | 0       |
| Handler_savepoint          | 0       |
| Handler_savepoint_rollback | 0       |
| Handler_tmp_update         | 0       |
| Handler_tmp_write          | 0       |
| Handler_update             | 0       |
| Handler_write              | 0       |
+----------------------------+---------+
25 rows in set (0.00 sec)

I believe this query is impossible to optimize no matter the index used- the reason is that it has 3 ranges and an order by, which a BTREE is not good for that. A rewrite should be considered (SELECT UNION?), or if it is not possible, disabling it entirely.

jcrespo created this task.Oct 25 2016, 9:05 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 25 2016, 9:05 AM
Anomie added a subscriber: Anomie.

Hmm. Yeah, theoretically it could realize that it could do this by combining multiple ranges from an appropriate index like we'd do manually with unions, but it doesn't seem like MySQL is likely to be smart enough to do that. It's doing the best it can now by using rc_timestamp and using the rest of the conditions to filter.

For this situation, it looks like we'd actually have to add another condition to the query to make it amenable to using a decent index at all: add AND rc_patrolled IN ('0','1') or else it'll still have to make bad choices instead of being able to directly use rc_name_type_patrolled_timestamp. Then we can decompose it manually into unions, which would look something like this:

(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='0') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='0') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='1') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='1') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='3') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='3') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='6') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='0') AND (rc_type='6') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='0') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='0') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='1') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='1') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='3') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='3') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='6') AND (rc_patrolled='0') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
UNION ALL
(SELECT rc_id, rc_timestamp, rc_namespace, rc_title, rc_cur_id, rc_type, rc_deleted, rc_this_oldid, rc_last_oldid FROM `recentchanges` WHERE (rc_timestamp>=20161024013525) AND (rc_namespace='120') AND (rc_type='6') AND (rc_patrolled='1') ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101)
ORDER BY rc_timestamp ASC, rc_id ASC LIMIT 101;

The worst case on dewiki would be to wind up with 240 subqueries with 5001 rows each, if someone happened to specify 24 of 25 namespaces with all 5 types and both values for rc_patrolled. We could reduce that by adding more indexes, e.g. we could halve it by adding (rc_namespace, rc_type, rc_timestamp) for when rc_patrolled doesn't matter, and if it's common enough (rc_type, rc_timestamp) would really cut down the permutations when all namespaces are being queried. It's already the default to only query 4 of the 5 values for rc_type, so it might not be worth adding a (rc_namespace, rc_timestamp) index to target with the unionization unless it turns out people are overriding that default often enough.

Thoughts on that plan before I look at actually writing the code? In particular, should we add a (rc_namespace, rc_type, rc_timestamp) index and/or others?

Well, I would wait a bit before working on this, to check if we really want to support this kind of complex queries in the first place.

I am thinking of a more general approach (maybe?) which would be prohibit complex queries and document those limitations. "Error: query too complex- The query you asked for uses a too complex set of condition that are known to be slow- please simplify your query". This may sound stupid, but maybe something like that should be enabled for the largest wikis? I honestly do not know.

Your UNION query would take 1.28 sec cold, 0.01 second hot, much less than the 1 minute of the original one, and much more efficient.

+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| Handler_commit             | 1      |
| Handler_delete             | 0      |
| Handler_discover           | 0      |
| Handler_external_lock      | 0      |
| Handler_icp_attempts       | 522954 |
| Handler_icp_match          | 373    |
| Handler_mrr_init           | 0      |
| Handler_mrr_key_refills    | 0      |
| Handler_mrr_rowid_refills  | 0      |
| Handler_prepare            | 0      |
| Handler_read_first         | 0      |
| Handler_read_key           | 16     |
| Handler_read_last          | 0      |
| Handler_read_next          | 361    |
| Handler_read_prev          | 0      |
| Handler_read_rnd           | 101    |
| Handler_read_rnd_deleted   | 0      |
| Handler_read_rnd_next      | 365    |
| Handler_rollback           | 0      |
| Handler_savepoint          | 0      |
| Handler_savepoint_rollback | 0      |
| Handler_tmp_update         | 0      |
| Handler_tmp_write          | 364    |
| Handler_update             | 0      |
| Handler_write              | 0      |
+----------------------------+--------+
25 rows in set (0.01 sec)

The problem I see here is maintainability and simplicity (?). I would need a product manager to take informed and consistent decisions on this, I do not know myself.

Regarding indexes, this works well, but I am open for new index creation as long as it is indeed useful to speed up queries.

Well, I would wait a bit before working on this, to check if we really want to support this kind of complex queries in the first place.

I am thinking of a more general approach (maybe?) which would be prohibit complex queries and document those limitations. "Error: query too complex- The query you asked for uses a too complex set of condition that are known to be slow- please simplify your query". This may sound stupid, but maybe something like that should be enabled for the largest wikis? I honestly do not know.

One trouble there is that making something that has worked for a long time suddenly stop working is liable to break a lot of API users' code. If we can reasonably avoid doing that, it would be better to avoid it.

A second trouble is that this particular example isn't really all that complex of a query: "I want to see changes since $TIME in either of two namespaces, that aren't 'external' (i.e. from Wikidata)". And the "that aren't 'external'" bit isn't even being explicitly requested, it's the default. It's not filtering by user, tag, flags (e.g. minor, bot, anon, redirect, patrolled), etc.

The problem I see here is maintainability and simplicity (?). I would need a product manager to take informed and consistent decisions on this, I do not know myself.

There is no product manager for the action API. For informed and consistent decisions for the action API people generally ask me.

@Tgr and @Legoktm most often give me code review, we can bring them in easily enough.

Regarding indexes, this works well, but I am open for new index creation as long as it is indeed useful to speed up queries.

In this case, it would be to reduce the number of subqueries in the union. I'd think the (rc_namespace, rc_type, rc_timestamp) index would be useful here as a better target for unionizing than the existing rc_name_type_patrolled_timestamp.

I'd think the (rc_namespace, rc_type, rc_timestamp) index would be useful here as a better target for unionizing than the existing rc_name_type_patrolled_timestamp.

There was some pending open index suggestion for recentchanges. We need to identify that (if it was the same or another index) and establish a plan for applying all pending design changes on the rc table. Changes on rc "should be easy".

Tgr added a comment.Oct 25 2016, 9:37 PM

This query asks for the recent content changes. That (separating content changes from changes in supplemental namespaces) is central to change patrolling, which is one of the main roles of Wikipedia editors. I doubt breaking the functionality is an option.

I guess you could make Database::select handle unions by adding some sort of placeholder tokens - $db->select( $table, $fields, [ 'rc_namespace' => $db->uniqueToken( 'namespace' ), 'rc_type' => $db->uniqueToken( 'type' ) ], __METHOD__, [ 'UNION ALL' => [ 'namespace' => $namespaces, 'type' => $types ] ] ); which would then iterate through all namespace/type combinations, string-replace the tokens, and join the queries with UNION ALL. It's mildly disgusting but reasonably maintainable.

Then again, a simple FORCE INDEX (rc_name_type_patrolled_timestamp) brings down the execution time to nearly the same level (3s cold / 0.25s warm) so I's just go with that.

Anomie added a comment.EditedOct 26 2016, 3:42 PM

I wouldn't bother trying to make Database:select somehow magically do it, I'd just build the subqueries with Database::selectSQLText() and combine them with Database::unionQueries(). Chances are I'd only do the unionization when $wgMiserMode is active and $dbr->unionSupportsOrderAndLimit() is true (i.e. not sqlite or mssql).

FORCE INDEX (rc_name_type_patrolled_timestamp) is going to fetch all matching rows and filesort them; while that works ok-ish for this query with only around 50000 matching rows on dewiki, it tends not to scale all that well when a slightly altered query gets into millions of rows.

Change 359501 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] ApiQueryRecentChanges: Make better use of the rc_name_type_patrolled_timestamp index

https://gerrit.wikimedia.org/r/359501

Anomie claimed this task.
1978Gage2001 moved this task from Triage to In progress on the DBA board.Dec 11 2017, 9:45 AM
Marostegui moved this task from In progress to Triage on the DBA board.Dec 11 2017, 11:06 AM