Page MenuHomePhabricator

stubs dumps run much slower after move from bufferied queries to batches
Closed, ResolvedPublic

Description

In https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/459885/ streaming mode in WikiExport was removed. Stub dumps used to write 2000 to 4000 revisions per second; now they write 1000 revisions in 2 to 3 minutes. On ruwiktionary, for example, the job is so slow that no new data has been written to the gzipped stubs output files since 5 am Oct 21 (it's 7 am Oct 22 now).

The new query uses a temp table and filesort and much too small a batch size. We need to fix this up.

Event Timeline

ArielGlenn triaged this task as High priority.Oct 22 2018, 7:11 AM
ArielGlenn created this task.

Here's what the query used to look like, on wikis where we forced ordering by rev_id within each page:

SELECT /* WikiExporter::dumpFrom ariel@trouble */  
rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions  
FROM `revision` 
FORCE INDEX (rev_page_id) 
INNER JOIN `page` ON ((rev_page=page_id))   
WHERE (rev_page >= 1 AND rev_page < 1)  
ORDER BY rev_page ASC,rev_id ASC

(more or less, there might be some things around rev_comment that are different)

It now looks like

SELECT /* WikiExporter::dumpPages dumpsgen@snapsh... */ 
rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  
FROM `page` 
INNER JOIN `revision` ON ((page_id = rev_page)) 
LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) 
LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   
WHERE (page_id >= 20001 AND page_id < 40001) AND (rev_page>24539 OR (rev_page=24539 AND rev_id>4516685))  
ORDER BY rev_page ASC,rev_id ASC
LIMIT 1000

We likely need to put the FORCE INDEX back in there.
The batch size of 1000 revisions is very tiny for dumps; if we are shooting for queries with 2-3 minutes running time, then we want something between 250k and 700k revisions on average. Consider that a stubs dump for enwiki consists of 860 million revisions, all of which need to be checked... in a reasonable period of time.

NicoV added a subscriber: NicoV.Oct 22 2018, 11:21 AM

Unsure if FORCE INDEX is going to be enough. Still looking at it.

Tentatively I think that adding order by rev_id + losing FORCE_INDEX for those cases plus losing the STRAIGHT_JOIN here https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/380669/ might be the perfect storm of conditions that got us, but I want to check stub production where we had order-by-revid after 380669 was deployed and check revision metadata retrieval, if any logs from then are still available.

Regarding batch size, would you anticipate a much larger number being appropriate for all wikis (including non-Foundation-hosted ones), or is it necessary to introduce another global for controlling this setting?

I don't know yet. Until I know 100% what went wrong and account for current and past behavior, I'm not even sure about batch size changes. My expectation though, if we do change it, is that we would change it globally, expecting third party wikis to show the same behavior.

It's possible this will have affected stubs generation for the adds/changes dumps, leaving a note here so I remember to look at it.

At last I have been able to do a bunch of real-world testing. This means: on the same host that the ruwiktionary dumps are running slowly now, for a page range that is not currently running but has already been seen to be slow, and with the current MW installation and same db server.

Backing off to wmf.23, I see about 900 or so revisions a second being processed

2018-10-23 13:55:19: ruwiktionary (ID 25526) 33 pages (26.9|305.9/sec all|curr), 1100 revs (895.5|926.9/sec all|curr), ETA 2018-10-23 17:02:04 [max 10034547]

in contrast to wmf.26, with about 2 revisions a second for much of the time:

2018-10-21 10:18:06: ruwiktionary (ID 24285) 1167 pages (0.1|1.8/sec all|curr), 29000 revs (1.5|1.5/sec all|curr), ETA 2019-01-05 09:21:20 [max 10030912]

If we add back the STRAIGHT_JOIN in WikiExporter in the stanza for the stubs (line 372), we are back to 900 or so revs a second:

2018-10-23 14:53:53: ruwiktionary (ID 46070) 220 pages (22.0|204.6/sec all|curr), 6000 revs (600.3|930.1/sec all|curr), ETA 2018-10-23 19:32:18 [max 10034600]

With the STRAIGHT_JOIN added, the USE INDEX option is respected. An 'explain extended' of the new query shows:

MariaDB [ruwiktionary]> explain extended SELECT /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment\
.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  FROM `revision` FO\
RCE INDEX (rev_page_id) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((rev_page=page_id))   WHERE (page_id >= 20001 AND page_id < 40000) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 1000000;
+------+-------------+---------------------+--------+------------------------+-------------+---------+-----------------------------------------------------+--------+----------+------------------------------------+
| id   | select_type | table               | type   | possible_keys          | key         | key_len | ref                                                 | rows   | filtered | Extra                              |
+------+-------------+---------------------+--------+------------------------+-------------+---------+-----------------------------------------------------+--------+----------+------------------------------------+
|    1 | SIMPLE      | revision            | range  | rev_page_id            | rev_page_id | 4       | NULL                                                | 875406 |   100.00 | Using index condition; Using where |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev | PRIMARY     | 4       | ruwiktionary.revision.rev_id                        |      1 |   100.00 | Using index                        |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                | PRIMARY     | 8       | ruwiktionary.temp_rev_comment.revcomment_comment_id |      1 |   100.00 | Using where                        |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                | PRIMARY     | 4       | ruwiktionary.revision.rev_page                      |      1 |   100.00 |                                    |
+------+-------------+---------------------+--------+------------------------+-------------+---------+-----------------------------------------------------+--------+----------+------------------------------------+
4 rows in set, 1 warning (0.00 sec)

No temp tables, no filesort.

In the meantime let's try a batchsize of 50k revs; my testing shows that the query is done before I can jump over to the right window and see it in a show full processlist, and that's fast enough for me.

I'd like to get this fix into master and backported to wmf.26 as soon as possible, since it's affecting running dumps.

Change 469213 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix stubs dump query to use straight join

https://gerrit.wikimedia.org/r/469213

A note that for comparison, with the above fix the ruwiktionary run processes 20k pages in about 7 minutes. The current run without the fix has taken about 3 days to do about 90k pages. This is a 'small' wiki :-)

I see I did not post a SHOW EXPLAIN from the current run (without STRAIGHT_JOIN); here it is.

MariaDB [(none)]> show explain for 599162104;
+------+-------------+---------------------+--------+--------------------------------------------------------+-------------+---------+-----------------------------------------------------+--------+-------------------------------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key         | key_len | ref                                                 | rows   | Extra                                           |
+------+-------------+---------------------+--------+--------------------------------------------------------+-------------+---------+-----------------------------------------------------+--------+-------------------------------------------------+
|    1 | SIMPLE      | page                | range  | PRIMARY                                                | PRIMARY     | 4       | NULL                                                |  29786 | Using where; Using temporary; Using filesort    |
|    1 | SIMPLE      | revision            | range  | PRIMARY,page_timestamp,page_user_timestamp,rev_page_id | rev_page_id | 8       | NULL                                                | 674124 | Using where; Using join buffer (flat, BNL join) |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY     | 4       | ruwiktionary.revision.rev_id                        |      1 | Using index                                     |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY     | 8       | ruwiktionary.temp_rev_comment.revcomment_comment_id |      1 | Using where                                     |
+------+-------------+---------------------+--------+--------------------------------------------------------+-------------+---------+-----------------------------------------------------+--------+-------------------------------------------------+
4 rows in set, 1 warning (0.00 sec)

For more on the use of STRAIGHT_JOIN when we produce page revisions in rev_id order, see T29112. The switch to always generate in rev_id order was also part of the patch removing buffering from WikiExporter.

Change 469213 merged by jenkins-bot:
[mediawiki/core@master] fix stubs dump query to use straight join

https://gerrit.wikimedia.org/r/469213

Change 469319 had a related patch set uploaded (by BPirkle; owner: ArielGlenn):
[mediawiki/core@wmf/1.32.0-wmf.26] fix stubs dump query to use straight join

https://gerrit.wikimedia.org/r/469319

Once this is deployed (hopefully in today's evening swat slot) I'll be checking the existing stub runs, and any that are stuck running the old slow code will get shot so that they can restart (for big wikis) a bit later in the morning, or run manually (for small wikis that can finish quickly).

Change 469319 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.26] fix stubs dump query to use straight join

https://gerrit.wikimedia.org/r/469319

I see the patch does both things at the same time (index choice and batch size). I'd be interested to see a benchmark of the stub dump with different batch sizes but with the table scanning issue fixed. It should only take a couple of milliseconds to fetch 1000 rows. Having the benchmark will help us design similar software in future.

Mentioned in SAL (#wikimedia-operations) [2018-10-23T23:30:52Z] <twentyafterfour@deploy1001> Synchronized php-1.32.0-wmf.26/includes/export/WikiExporter.php: sync https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/469319/ refs T207628 (duration: 01m 39s)

Change 469349 had a related patch set uploaded (by Legoktm; owner: ArielGlenn):
[mediawiki/core@REL1_32] fix stubs dump query to use straight join

https://gerrit.wikimedia.org/r/469349

I see the patch does both things at the same time (index choice and batch size). I'd be interested to see a benchmark of the stub dump with different batch sizes but with the table scanning issue fixed. It should only take a couple of milliseconds to fetch 1000 rows. Having the benchmark will help us design similar software in future.

Here's the results for 1k, 10k, 50k, for starters:

MariaDB [ruwiktionary]> pager > /dev/null;
PAGER set to '> /dev/null'
MariaDB [ruwiktionary]> SELECT /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  FROM `revision` FORCE INDEX (rev_page_id) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((rev_page=page_id))   WHERE (page_id >= 20001 AND page_id < 40000) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 1000;
1000 rows in set (0.02 sec)

MariaDB [ruwiktionary]> SELECT /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  FROM `revision` FORCE INDEX (rev_page_id) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((rev_page=page_id))   WHERE (page_id >= 20001 AND page_id < 40000) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 10000;
10000 rows in set (0.14 sec)

MariaDB [ruwiktionary]> SELECT /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  FROM `revision` FORCE INDEX (rev_page_id) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `page` ON ((rev_page=page_id))   WHERE (page_id >= 20001 AND page_id < 40000) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
50000 rows in set (0.41 sec)

I should probably audit all queries run out of WikiExporter during dumps production one of these days.

In the meantime, I've shot the two dumpBackup jobs that were writing frwiki stubs and let the wrapper restart them. They are now writing out correct new files. Other wikis seem not to have runs in progress with the bad query.

enwiki stubs part 13 do not end with a </mediawiki> tag; they contain only about the first 1/10th of the pages. I'm removing them and running stubs for those manually in screen on snapshot1005. Once done I'll move those files into place and the enwp job will continue normally.

The missing stubs files have been put into place, and I have started the recombine job manually just for added measure. When that's done I'll kick the regular cron job back off.

Changed my mind, I've set the cron to start now, and it's going but starting with articles dumps. This means the xmlstubs recombine will run in the last catch-all phase, rather than right now. I'll check on things in a couple hours to see how they are doing.

Change 469440 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] remove orderRevs option for some functions

https://gerrit.wikimedia.org/r/469440

Change 469349 merged by jenkins-bot:
[mediawiki/core@REL1_32] fix stubs dump query to use straight join

https://gerrit.wikimedia.org/r/469349

I need to check the behavior of Special:Exports page lists, adds/changes stubs, and abstracts dumps; everything else should be ok, at least as far as all of WMF's use cases. The abstracts are probably fine since it's only the current revisions included, but i'd rather check to be sure.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Oct 25 2018, 7:59 AM

Abstracts dump queries complete pretty quickly but they still copy to a temp table. Sample query below from current abstracts dump run for enwiki:

| 1789897141 | wikiadmin            | 10.64.48.37:51362 | enwiki             | Query       |       0 | Copying to tmp table                                                        | SELECT /* WikiExporter::dumpPages dumpsgen@snapsh... */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_id >= 5141431 AND page_id < 5151431) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000 |    0.000 |

Here's the explain extended for that query:

MariaDB [enwiki]> explain extended SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_id >= 2870716 AND page_id < 2880716) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+-------+----------+----------------------------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key     | key_len | ref                                           | rows  | filtered | Extra                                        |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+-------+----------+----------------------------------------------+
|    1 | SIMPLE      | page                | range  | PRIMARY                                                | PRIMARY | 4       | NULL                                          | 10834 |   100.00 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY | 4       | enwiki.page.page_latest                       |     1 |   100.00 | Using where                                  |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY | 4       | enwiki.page.page_latest                       |     1 |   100.00 | Using index                                  |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY | 8       | enwiki.temp_rev_comment.revcomment_comment_id |     1 |   100.00 | Using where                                  |
|    1 | SIMPLE      | text                | eq_ref | PRIMARY                                                | PRIMARY | 4       | enwiki.revision.rev_text_id                   |     1 |   100.00 |                                              |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+-------+----------+----------------------------------------------+
5 rows in set, 1 warning (0.00 sec)

Here's how long that query takes to actually run (10k pages, latest revision per page):

MariaDB [enwiki]> pager > /dev/null;
PAGER set to '> /dev/null'
MariaDB [enwiki]> SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_id >= 2870716 AND page_id < 2880716) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
5491 rows in set (0.14 sec)

The following query, which orders by page_id instead of rev_page, and drops ordering by rev_id (since there's only every one revision per page) fixes the temp table issue:

MariaDB [enwiki]> explain extended SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_id >= 2870716 AND page_id < 2880716) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY page_id ASC LIMIT 50000;
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+-------+----------+-------------+
| id   | select_type | table               | type   | possible_keys                                          | key     | key_len | ref                                           | rows  | filtered | Extra       |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+-------+----------+-------------+
|    1 | SIMPLE      | page                | range  | PRIMARY                                                | PRIMARY | 4       | NULL                                          | 10834 |   100.00 | Using where |
|    1 | SIMPLE      | revision            | eq_ref | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY | 4       | enwiki.page.page_latest                       |     1 |   100.00 | Using where |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY | 4       | enwiki.page.page_latest                       |     1 |   100.00 | Using index |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY | 8       | enwiki.temp_rev_comment.revcomment_comment_id |     1 |   100.00 | Using where |
|    1 | SIMPLE      | text                | eq_ref | PRIMARY                                                | PRIMARY | 4       | enwiki.revision.rev_text_id                   |     1 |   100.00 |             |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+-------+----------+-------------+
5 rows in set, 1 warning (0.00 sec)

Time to execute:

MariaDB [enwiki]> pager > /dev/null;
PAGER set to '> /dev/null'
MariaDB [enwiki]> SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_id >= 2870716 AND page_id < 2880716) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY page_id ASC LIMIT 50000;
5491 rows in set (0.16 sec)

No appreciable difference for this number of pages (10k). For a larger number, we might see a difference, I have not tested it out.

Change 469599 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] always order by page_id for dumps of current revisions

https://gerrit.wikimedia.org/r/469599

Here is the query from the adds-changes dump first pass ('stubs'):

MariaDB [enwiki]> explain extended SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id = rev_page)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   WHERE (rev_id >= 865352891 AND rev_id < 865668079) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+--------+----------+-----------------------------+
| id   | select_type | table               | type   | possible_keys                                          | key     | key_len | ref                                           | rows   | filtered | Extra                       |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+--------+----------+-----------------------------+
|    1 | SIMPLE      | revision            | range  | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY | 4       | NULL                                          | 669144 |   100.00 | Using where; Using filesort |
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                                                | PRIMARY | 4       | enwiki.revision.rev_page                      |      1 |   100.00 |                             |
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev                                 | PRIMARY | 4       | enwiki.revision.rev_id                        |      1 |   100.00 | Using index                 |
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                                                | PRIMARY | 8       | enwiki.temp_rev_comment.revcomment_comment_id |      1 |   100.00 | Using where                 |
+------+-------------+---------------------+--------+--------------------------------------------------------+---------+---------+-----------------------------------------------+--------+----------+-----------------------------+
4 rows in set, 1 warning (0.00 sec)

Execution time:

MariaDB [enwiki]> pager > /dev/null;
PAGER set to '> /dev/null'
MariaDB [enwiki]> SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_text_id,rev_content_format,rev_content_model,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id = rev_page)) LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   WHERE (rev_id >= 865352891 AND rev_id < 865668079) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
50000 rows in set (1.20 sec)

We get here by invoking

/usr/bin/php7.0 /srv/mediawiki_atg/multiversion/MWScript.php dumpBackup.php --wiki=enwiki --stub --output=gzip:/mnt/dumpsdata/temp/dumpsgen/testoutput/incr/enwiki/20181025/enwiki-20181025-stubs-meta-hist-incr.xml.gz --revrange --revstart=865352891 --revend=865668079

so this is the way WikiExporter::RANGE gets used.

ArielGlenn added a comment.EditedOct 25 2018, 4:03 PM

I"m not sure we can do any better for the revs by range dumps. I'll ask the dba folks at some point when one of them has a few spare cycles, but just looking at and playing with the indices available and the ordering we require, even without batching, I don't see how we can avoid filesort without looking at a whole lot more rows and making the entire query much more expensive. Unless someone has a bright idea, I'm going to call this one 'good enough for now'.

I've checked Special:Export cases for pages on a local install. As such, the parts of the queries deling with rev comments and rev users are slightly different but shouldn't matter here. I export a list of titles; this translates into multiple queries each for a single page title.

First export is without revision history:

MariaDB [enwiki]> explain extended SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_namespace=10 AND page_title='Marked') AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY page_id ASC LIMIT 50000;
+------+-------------+----------+-------+--------------------------------------------------------+------------+---------+-------------+------+----------+-------+
| id   | select_type | table    | type  | possible_keys                                          | key        | key_len | ref         | rows | filtered | Extra |
+------+-------------+----------+-------+--------------------------------------------------------+------------+---------+-------------+------+----------+-------+
|    1 | SIMPLE      | page     | const | PRIMARY,name_title                                     | name_title | 261     | const,const |    1 |   100.00 |       |
|    1 | SIMPLE      | revision | const | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | PRIMARY    | 4       | const       |    1 |   100.00 |       |
|    1 | SIMPLE      | text     | const | PRIMARY                                                | PRIMARY    | 4       | const       |    1 |   100.00 |       |
+------+-------------+----------+-------+--------------------------------------------------------+------------+---------+-------------+------+----------+-------+
3 rows in set, 1 warning (0.00 sec)

MariaDB [enwiki]> pager > /dev/null;
PAGER set to '> /dev/null'
MariaDB [enwiki]> SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_namespace=10 AND page_title='Marked') AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY page_id ASC LIMIT 50000;
1 row in set (0.00 sec)

This is with https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/469599/ applied to my local install, then running the resulting query on production s1 vslow db . It looks good.

The second query is an export including all history, for the same titles. Here too each page title is requested separately in the code.

MariaDB [enwiki]> explain extended SELECT /* WikiExporter::dumpPages  */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id = rev_page)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_namespace=10 AND page_title='Marked') AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_timestamp ASC LIMIT 50000;
+------+-------------+----------+--------+--------------------------------------------------------+----------------+---------+-----------------------------+------+----------+-------------+
| id   | select_type | table    | type   | possible_keys                                          | key            | key_len | ref                         | rows | filtered | Extra       |
+------+-------------+----------+--------+--------------------------------------------------------+----------------+---------+-----------------------------+------+----------+-------------+
|    1 | SIMPLE      | page     | const  | PRIMARY,name_title                                     | name_title     | 261     | const,const                 |    1 |   100.00 |             |
|    1 | SIMPLE      | revision | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | const                       |    8 |   100.00 | Using where |
|    1 | SIMPLE      | text     | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.revision.rev_text_id |    1 |   100.00 |             |
+------+-------------+----------+--------+--------------------------------------------------------+----------------+---------+-----------------------------+------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)

MariaDB [enwiki]> pager > /dev/null;
PAGER set to '> /dev/null'
MariaDB [enwiki]> SELECT /* WikiExporter::dumpPages  */  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id = rev_page)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (page_namespace=10 AND page_title='Marked') AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_timestamp ASC LIMIT 50000;
8 rows in set (0.00 sec)

This orders things by rev_timestamp which is easy since we have the covering key page_timestamp on the revision table.

So things are looking pretty good; I think the only remaining case to check is Special:Export of all pages.

Special:Export for exportall, assuming a user has the right, always exports full histories, even if 'current only' is checked. Leaving aside whether that's a bug, the query that results (taken from local install, comment/actor handling will be slightly different than production) is:

MariaDB [enwiki]> explain extended SELECT rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `page` INNER JOIN `revision` ON ((page_id = rev_page)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+----------+--------+--------------------------------------------------------+----------------+---------+-----------------------------+----------+----------+----------------------------------------------+
| id   | select_type | table    | type   | possible_keys                                          | key            | key_len | ref                         | rows     | filtered | Extra                                        |
+------+-------------+----------+--------+--------------------------------------------------------+----------------+---------+-----------------------------+----------+----------+----------------------------------------------+
|    1 | SIMPLE      | page     | range  | PRIMARY                                                | PRIMARY        | 4       | NULL                        | 23038560 |   100.00 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | revision | ref    | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | page_timestamp | 4       | enwiki.page.page_id         |        5 |   100.00 | Using index condition; Using where           |
|    1 | SIMPLE      | text     | eq_ref | PRIMARY                                                | PRIMARY        | 4       | enwiki.revision.rev_text_id |        1 |   100.00 |                                              |
+------+-------------+----------+--------+--------------------------------------------------------+----------------+---------+-----------------------------+----------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.09 sec)

We can't rely on a straight join and use of the rev_page_id index to save us here, because we're not dumping a small range of pages:

MariaDB [enwiki]> explain extended SELECT /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,rev_comment AS `rev_comment_text`,NULL AS `rev_comment_data`,NULL AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,old_text,old_flags,page_restrictions, rev_text_id  FROM `revision` FORCE INDEX (rev_page_id) INNER JOIN `page` ON ((rev_page=page_id)) INNER JOIN `text` ON ((rev_text_id=old_id))   WHERE (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+----------+--------+---------------+-------------+---------+-----------------------------+-----------+----------+------------------------------------+
| id   | select_type | table    | type   | possible_keys | key         | key_len | ref                         | rows      | filtered | Extra                              |
+------+-------------+----------+--------+---------------+-------------+---------+-----------------------------+-----------+----------+------------------------------------+
|    1 | SIMPLE      | revision | range  | rev_page_id   | rev_page_id | 8       | NULL                        | 399057886 |   100.00 | Using index condition; Using where |
|    1 | SIMPLE      | page     | eq_ref | PRIMARY       | PRIMARY     | 4       | enwiki.revision.rev_page    |         1 |   100.00 |                                    |
|    1 | SIMPLE      | text     | eq_ref | PRIMARY       | PRIMARY     | 4       | enwiki.revision.rev_text_id |         1 |   100.00 |                                    |
+------+-------------+----------+--------+---------------+-------------+---------+-----------------------------+-----------+----------+------------------------------------+
3 rows in set, 1 warning (0.01 sec)

No filesort and no temp table, but the number of rows filtered is now the size of the revision table.
The way to do proper batching of the full export of all pages for any wiki larger than 20k pages would be to do a series of these queries each for a small page range. This should go on the todo list for third party users, but it's out of the scope of this ticket.

I've been writing a tool to get me tabe structure and mariadb version info across all the wikis for given shards in a given db, so I can run some of these queries on differing setups, if there are such. (Need to decide where to put this crappy tool other than the local repo on my laptop.) Have output now for all wikis with more than 200k revisions, anything smaller we probably don't care about since in 4 batches a query is complete, even if slow. The big deals right now are: actor table not there on some servers, keys on the revision table vary on some servers; and I'm curious about the MAX_ROWS parameter I see on some dbs though it's not directly relevant to this task.

I'll check to see if any of these differences is important for us (example: we don't care about differing keys on the rc log hosts), and will record the results here.

After that I should test all queries on wikis of a few different sizes on a few different shards, and on any dbs where key or table differences matter; once that's done I'll feel good about requesting merges for patches.

I have a list of db hosts and wikis on which to run a set of queries; this will cover hosts in each shard that have different indexes on the revision table. There are no other table structure differences of interest for us. I've chosen a range of medium, large and huge wikis on which to test the list of queries: dewiki, elwiki, enwiki, frwiki, metawiki, wikidatawiki. Half of the queries will run without any offset in the revision condition (as though they are starting with the first batch of 50k revisions) and the other half will be identical but will run with an offset of 50k or more. I have identified pages with the greatest numbers of revisions on each of these wikis for this purpose.

I will run an EXPLAIN for each of these queries, which essentially correspond to the query for each of the if clauses in WikiExporter::dumpPages, with and without the various fixes applied in patches above.

Once I have those and have looked at them closely, I'll post a summary here. If all looks well I'll get dba input before requesting a merge.

Table comparison script, such as it is, is stashed here: https://github.com/apergos/misc-wmf-crap/tree/master/checktables I doubt it's of use to anyone else but there it is for the record.

And this is delayed yet again; I have a 'run queries with variable substitution' script but in fact now what I need is a script that will run the real query, grab the thread id, show explain for that thread id, then kill that query. This would ensure that the explain we get is really what runs. Working on that script right now.

I've got a few scripts written and they're gross. But they work. Here's one that generates values I need for in a format that my show explain script can read:
https://github.com/apergos/misc-wmf-crap/blob/master/runqueries/genrqsettings.py
I've run that for the following wikis: frwiki, jawiki, plwiki, commonswiki, enwiki, wikidatawik, frwiktionary. That output, with a bit of cleanup, will get fed to:
https://github.com/apergos/misc-wmf-crap/blob/master/runqueries/explain_sql_query.py
which will run the query of interest on the specified host, show explain it from a subprocess, then shoot the original query in case it happens to take time to complete.
Still in the works is a script to collect the sql query text right out of the WikiExporter class as it runs, by monkey-patching (eww); we need this so we can keep up with changes to the code.

For the moment, I'll be doing a show explain for the various WikiExporter queries as they stand, on a selection of hosts that have differences in table structure (indexes pretty much) from the en wiki master at the time I looked a few days ago. That should be good enough for a first audit.

Forgot to add, we now can check all db hosts and wikis' table structures against one presumptive master db server/wiki; this is how I collected a small list of hosts and wikis with differences. That's this script again: https://github.com/apergos/misc-wmf-crap/blob/master/checktables/check_table_structures.py Yeah I know it's awful. But for right now it gets the job done.

Change 478708 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/software@master] query checking scripts for auditing WikiExporter (dumps) queries

https://gerrit.wikimedia.org/r/478708

daniel added a subscriber: daniel.Dec 17 2018, 12:21 PM

@ArielGlenn there is a lot of stuff here, I lost track at some point. Can you tell me whether anything urgent remains to be done here? Also, do we have tickets for the things that are not urgent, but should be addressed or investigated?

I need to do a full run of my auditing scripts on the production dbs; I've onlydone partial runs until now. Once that happens I'll boil down the results, post them here, link to the patchset(s) I think are good to go, and defer anything else that's not urgent. I'll make sure there are tickets for those things; don't expect that until near the end of the week though.

Legoktm added a subscriber: Legoktm.

The main reason I had tagged this as a 1.32 blocker was because of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/469349 - which is merged/taken care of. I don't see anything else that is release blocker worthy.

Change 478708 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/software@master] db query checking scripts for auditing WikiExporter (dumps) queries

https://gerrit.wikimedia.org/r/478708

Change 469599 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] always order by page_id for dumps of current revisions

https://gerrit.wikimedia.org/r/469599

I've run the following without and with the above patch:

/usr/bin/php7.2 /srv/mediawiki_mcrtest/multiversion/MWScript.php dumpBackup.php --wiki=enwiki /srv/mediawiki_mcrtest/php-1.34.0-wmf.6 --plugin=AbstractFilter:/srv/mediawiki_mcrtest/php-1.34.0-wmf.6/extensions/ActiveAbstract/includes/AbstractFilter.php --current --report=300 --output=gzip:/mnt/dumpsdata/temp/dumpsgen/abstract-testing.xml.gz --filter=namespace:NS_MAIN --filter=noredirect --filter=abstract --skip-header --start=6500000 --skip-footer --end 6600000

Times were comparable, with the patched run being slightly faster.

I captured the actual query run by logging from selectSQLText, and ran SHOW EXPLAIN on the two versions of the query (with and without page offset), with and without the patch. With the patch, a 'Using temporary' notation disappears. See F29235451 for the details.

Nope, this is a different fix :-) We still need the one you proposed in https://gerrit.wikimedia.org/r/511126 which I have yet to test (apologies!)

P8571 (hm why did I do this as a paste and not a file this time? oh well) is the same 'order by page_id vs order by rev_page' but on a smaller wiki, eowiki this time. One of the queries shows no results because it finished fat enough the scripted 'connect to the db, and quick do a show explain' doesn't get there before the to-be-explained query is already completed.

Change 469440 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] remove orderRevs param from dump scripts

https://gerrit.wikimedia.org/r/469440

Change 513568 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] remove orderrev config option, no longer needed

https://gerrit.wikimedia.org/r/513568

Change 513570 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] remove orderrevs param from dumps manifests

https://gerrit.wikimedia.org/r/513570

Change 513568 merged by ArielGlenn:
[operations/dumps@master] remove orderrev config option, no longer needed

https://gerrit.wikimedia.org/r/513568

Change 513570 merged by ArielGlenn:
[operations/puppet@production] remove orderrevs param from dumps manifests

https://gerrit.wikimedia.org/r/513570

Change 469599 merged by jenkins-bot:
[mediawiki/core@master] always order by page_id for dumps of current revisions

https://gerrit.wikimedia.org/r/469599

F29332559 has the latest and greatest test runs. In each case I run order by page_id first, then rev_page,rev_id second. I then repeat the test to see how the timings are different. In all cases but dewiki/eswiki first runs, the page_id is faster; this includes enwiki which I did on a lark at the end.

One thing I learned is that batches need to be smaller for ruwiki, but that's a side topic.

ArielGlenn closed this task as Resolved.Jul 3 2019, 6:00 AM
ArielGlenn claimed this task.

There's more that can be done but specific tasks can be opened as necessary. The larger issue is resolved.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Jul 20 2019, 5:23 AM