Page MenuHomePhabricator

Abstracts dumps for Commons running very slowly
Closed, ResolvedPublic0 Estimated Story Points

Description

Last month they took 1 day; this month they have taken 3 days so far. Why is this?

In the meantime, in order to produce abstract dumps we request all pages and throw away everything not in the main namespace. We should instead query just for pages in the main namespace; that would make dumps of projects like Commons run a lot faster.

Event Timeline

ArielGlenn triaged this task as High priority.Apr 15 2019, 6:56 AM
ArielGlenn created this task.
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Apr 15 2019, 9:55 AM

What we want: namespace 0 only in the SELECT. Here's the explain from that:

root@PRODUCTION s4 slave[commonswiki]> explain extended SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions,1 AS `_load_content`  FROM `page` JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   WHERE (page_id >= 200 AND page_id < 2200) AND (page_namespace = 0) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+----------------------------------------------------+------+----------+----------------------------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key     | key_len | ref                                                | rows | filtered | Extra                                        |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+----------------------------------------------------+------+----------+----------------------------------------------+
|    1 | SIMPLE      | page                | range  | PRIMARY,name_title                                     | PRIMARY | 4       | NULL                                               | 3206 |    75.02 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | PRIMARY | 4       | commonswiki.page.page_latest                       |    1 |   100.00 | Using where                                  |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY | 4       | commonswiki.page.page_latest                       |    1 |   100.00 | Using index                                  |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY | 8       | commonswiki.temp_rev_comment.revcomment_comment_id |    1 |   100.00 |                                              |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+----------------------------------------------------+------+----------+----------------------------------------------+
4 rows in set, 1 warning (0.00 sec)

Here's the explain from the query for getting the current revision for pages rows as we have it now:

root@PRODUCTION s4 slave[commonswiki]> explain extended SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions,1 AS `_load_content`  FROM `page` JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   WHERE (page_id >= 200 AND page_id < 2200) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+----------------------------------------------------+------+----------+----------------------------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key     | key_len | ref                                                | rows | filtered | Extra                                        |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+----------------------------------------------------+------+----------+----------------------------------------------+
|    1 | SIMPLE      | page                | range  | PRIMARY                                                | PRIMARY | 4       | NULL                                               | 3206 |   100.00 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | PRIMARY | 4       | commonswiki.page.page_latest                       |    1 |   100.00 | Using where                                  |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY | 4       | commonswiki.page.page_latest                       |    1 |   100.00 | Using index                                  |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY | 8       | commonswiki.temp_rev_comment.revcomment_comment_id |    1 |   100.00 |                                              |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+----------------------------------------------------+------+----------+----------------------------------------------+
4 rows in set, 1 warning (0.00 sec)

I don't see that it's any worse frankly to just slap on that namespace condition in there, what do the dbas think? Adding @Marostegui for advice.

The plan doesn't look different from those explains, however, we might want to actually check the real plan the optimizer is running, as we have seen before that the explain might differ from the actual plan it really runs.
You can try to run the query and in a different shell identify the process with show processlist and then run a show explain for ID where the ID is the one gotten for that specific query on show processlist

I should point out that the number of pages we ask for differs in practice from the above; typically it's going to be a range of up to 10k pages. I"ll run with that and see if I can get the explain for it at the time (I have a script for that!)

One thing that occurs to me is that on wikis where the majority of the pages are not in the main namespace, the LIMIT might not be as effective as we want, and these could turn into pretty slow queries, especially on Commons where the vast majority of titles are in the File: namespace. I'm going to pursue a quick-n-dirty workaround in WikiExporter for testing first, and investigate modifying the query a little bit later.

Change 504792 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] allow xml page content or metadata dumps to target specific namespaces

https://gerrit.wikimedia.org/r/504792

Change 504842 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] for abstract dumps, skip any processing of pages not in main namespace

https://gerrit.wikimedia.org/r/504842

The above patches have been tested locally and I have just used them on Commons to regenerate part 5 of the abstracts in record time. This does not address the underlying cause, which I suspect to be somewhere in the revision handling for MCR. Still investigating that.

@Marostegui sorry to ping you again but we'd like your expertise: I can use this workaround of using he query as is and throwing away the revisions we don't want (LIMIT 50000 always) or we can change it to join on page_namespace right in the query (but the LIMIT will still be 50k). The upside of the second is that much less data will be sent but the downside is that it will take a lot longer to hit that LIMIT, what do you think about this tradeoff? Which is harder on the db servers and is the difference appreciable either way?

Since the abstracts for Commons take (at least) 3 days without that patch, meaning that the current (Apr 20th) run could not complete in time (20th through 30th), I'm running them manually out f a screen session on snapshot1008, live-patched, and I'll clean up the patches at the end of the run.

@Marostegui sorry to ping you again but we'd like your expertise: I can use this workaround of using he query as is and throwing away the revisions we don't want (LIMIT 50000 always) or we can change it to join on page_namespace right in the query (but the LIMIT will still be 50k). The upside of the second is that much less data will be sent but the downside is that it will take a lot longer to hit that LIMIT, what do you think about this tradeoff? Which is harder on the db servers and is the difference appreciable either way?

I don't think there will be a lot of difference. Theoretically, the less amount of data you send, the faster the query will finish, but this isn't always true. My recommendation would be to test both queries on a real and check if there is any really appreciable difference, check the query plans to make sure it is not doing something crazy and then pick the faster one (if it sends lots of data but it finish fast, we don't really mind).

Change 504792 merged by jenkins-bot:
[mediawiki/core@master] allow xml page content or metadata dumps to target specific namespaces

https://gerrit.wikimedia.org/r/504792

The above should get around to all wikis by the end of the week, if the train has no issues; then I'll be able to merge https://gerrit.wikimedia.org/r/#/c/504842/ which will at least get us some run time back, while work goes ahead on the underlying issue.

Change 504842 merged by ArielGlenn:
[operations/dumps@master] for abstract dumps, skip any processing of pages not in main namespace

https://gerrit.wikimedia.org/r/504842

There is more optimization that can be done, but between the above patchset and https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511126/ we are in much better shape. Additional speedups can be discussed in new tasks.

ArielGlenn closed this task as Resolved.Jul 8 2019, 8:05 AM
ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Jul 20 2019, 5:23 AM