Page MenuHomePhabricator

Special:Export WikiExporter::dumpPages query needs optimization
Closed, ResolvedPublic

Description

Note: I am not entirely sure where this query belongs to, so please, feel free to edit tags and subscribers as needed.

The following query is being killed by the query killer and hence never completing as it takes longer than 1 minute to execute - it takes 1 row in set (11 min 32.981 sec) for enwiki
This is the query on enwiki:

SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,temp_rev_user.revactor_actor AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model,page_restrictions,1 AS `_load_content`  FROM `page` JOIN `revision` ON ((page_id = rev_page)) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `slots` ON ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (page_namespace=0 AND page_title='Bitcoin') AND (rev_timestamp > '19700101000001') AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_timestamp ASC LIMIT 1 ;

And the EXPLAIN:

+------+-------------+---------------------+--------+----------------------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-------------+
| id   | select_type | table               | type   | possible_keys                                                        | key           | key_len | ref                                           | rows      | Extra       |
+------+-------------+---------------------+--------+----------------------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-------------+
|    1 | SIMPLE      | page                | const  | PRIMARY,name_title                                                   | name_title    | 261     | const,const                                   |         1 |             |
|    1 | SIMPLE      | revision            | range  | PRIMARY,rev_timestamp,page_timestamp,page_user_timestamp,rev_page_id | rev_timestamp | 16      | NULL                                          | 400072257 | Using where |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                               | PRIMARY       | 4       | enwiki.revision.rev_id                        |         1 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                              | PRIMARY       | 8       | enwiki.temp_rev_comment.revcomment_comment_id |         1 |             |
|    1 | SIMPLE      | temp_rev_user       | ref    | PRIMARY,revactor_rev,actor_timestamp                                 | PRIMARY       | 4       | enwiki.revision.rev_id                        |         1 | Using index |
|    1 | SIMPLE      | actor_rev_user      | eq_ref | PRIMARY                                                              | PRIMARY       | 8       | enwiki.temp_rev_user.revactor_actor           |         1 |             |
|    1 | SIMPLE      | slots               | ref    | PRIMARY,slot_revision_origin_role                                    | PRIMARY       | 8       | enwiki.revision.rev_id                        |         1 | Using where |
|    1 | SIMPLE      | content             | eq_ref | PRIMARY                                                              | PRIMARY       | 8       | enwiki.slots.slot_content_id                  |         1 |             |
+------+-------------+---------------------+--------+----------------------------------------------------------------------+---------------+---------+-----------------------------------------------+-----------+-------------+
8 rows in set (0.002 sec)

The trace comes from Error 2006 from WikiExporter::dumpPages

Error 2006 from WikiExporter::dumpPages, {error} {sql1line} {db_server}
 url	   	/w/index.php?title=Special:Export
fname	   	WikiExporter::dumpPages

Event Timeline

Is that the EXPLAIN for the running query or what mariadb claims it would do? Just being careful... I would expect to see rev_page used to limit the revisions searched to those of the specified page only, and then rev_timestamp to limit the selection. What's going on?

It will be hard to get the original query as it produced by Database::select, because in WikiExporter::dumpPage there is

$revQuery = $this->revisionStore->getQueryInfo( [ 'page' ] );

which does a fair amount of work, also invoking

$this->commentStore->getJoin( 'rev_comment' )

and

$this->actorMigration->getJoin( 'rev_user' );

Maybe we can hack the function locally on deployment-prep to write out the value of $sql as returned from selectSQLText and get a look at it that way.

eprodromou subscribed.

OK, seems like an issue. We'll accept this for Clinic Duty to see how we can help.

OIC, this is not using rev_page_id as an index when it really ought to be.

In the $this->history & self::FULL conditional of WikiExporter::dumpPage, we work around this for other cases. (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/WikiExporter.php#415 )

Maybe the same is needed here. We could add something to the if is_array( $this->history ) conditional and check if we have been explicitly given a page name and title, then force the index to be rev_page_id in that case. (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/export/WikiExporter.php#397 ) Thoughts?

It's ordering by rev_timestamp, so the correct index is actually rev_page_timestamp.

@Ammarpad recently renamed that index in tables.sql, so it's not possible to force it anymore.

In any case, doing an EXPLAIN in production now shows the correct query plan, and the query is fast. Is it possible MariaDB got smarter over the last year? Did the server version change?

@Ammarpad recently renamed that index in tables.sql, so it's not possible to force it anymore.

In any case, doing an EXPLAIN in production now shows the correct query plan, and the query is fast. Is it possible MariaDB got smarter over the last year? Did the server version change?

It can definitely be the case as we are migrating to 10.4, we've seen this before unfortunately: the optimizer behaving well from one version to another.

Marostegui claimed this task.
Marostegui added a project: DBA.

This seems fixed in all 10.4 hosts (we no longer run 10.1 anywhere).
And for what is worth, 10.6 behaves fine too.