Page MenuHomePhabricator

Xml stubs dumps are running 5 to 15x slower than previously
Closed, ResolvedPublic0 Story Points

Description

We get progresss reports of how many revisions/second are processed for xml stubs. I looked at some samples: in Jan 2019 and Mar 20 2019 these numbers ranged from 7000 to 22000 revs/sec. During this run (before encountering exceptions which broke the run) the numbers ranged from 450 to 500 revs/sec.

Excerpts:

Jan 1 2019 run enwiki, samples from earlier created (not new) pages for stubs:
2019-01-01 12:47:10: enwiki (ID 255257) 127 pages (108.4|1509.3/sec all|curr), 5000 revs (4266.8|11884.4/sec all|curr), ETA 2019-01-03 21:50:04 [max 876291835]
2019-01-01 12:47:10: enwiki (ID 255258) 236 pages (201.9|2823.8/sec all|curr), 5000 revs (4278.0|11965.1/sec all|curr), ETA 2019-01-03 21:41:04 [max 876291835]
2019-01-01 12:47:11: enwiki (ID 255248) 422 pages (325.8|9631.6/sec all|curr), 7000 revs (5404.5|22823.8/sec all|curr), ETA 2019-01-03 09:49:31 [max 876291835]
2019-01-01 12:47:11: enwiki (ID 255249) 427 pages (331.4|5539.3/sec all|curr), 8000 revs (6208.9|12972.7/sec all|curr), ETA 2019-01-03 03:59:25 [max 876291835]
2019-01-01 12:47:11: enwiki (ID 255251) 179 pages (132.3|3987.0/sec all|curr), 7000 revs (5174.3|22273.6/sec all|curr), ETA 2019-01-03 11:49:45 [max 876291835]

Mar 20 2019, similar samples
2019-03-20 15:12:20: enwiki (ID 265013) 7878 pages (597.9|139556.8/sec all|curr), 67000 revs (5085.4|17714.8/sec all|curr), ETA 2019-03-22 15:44:33 [max 888653661]
2019-03-20 15:12:20: enwiki (ID 265030) 6679 pages (537.9|52339.6/sec all|curr), 55000 revs (4429.2|7836.4/sec all|curr), ETA 2019-03-22 22:56:01 [max 888653664]
2019-03-20 15:12:20: enwiki (ID 264996) 11100 pages (840.6|174888.7/sec all|curr), 96000 revs (7269.9|15755.7/sec all|curr), ETA 2019-03-22 01:09:24 [max 888653661]
2019-03-20 15:12:20: enwiki (ID 264969) 9437 pages (706.8|114008.0/sec all|curr), 100000 revs (7489.5|12081.0/sec all|curr), ETA 2019-03-22 00:09:39 [max 888653661]
2019-03-20 15:12:20: enwiki (ID 264948) 11055 pages (825.4|179958.2/sec all|curr), 127000 revs (9481.8|16278.4/sec all|curr), ETA 2019-03-21 17:14:08 [max 888653661]

Apr 1 2019 similar samples
2019-04-01 12:49:10: enwiki (ID 91823) 118 pages (0.7|55.4/sec all|curr), 56000 revs (309.3|469.8/sec all|curr), ETA 2019-05-04 20:32:01 [max 890452225]
2019-04-01 12:49:10: enwiki (ID 91825) 333 pages (1.8|167.8/sec all|curr), 56000 revs (309.3|504.0/sec all|curr), ETA 2019-05-04 20:35:51 [max 890452225]
2019-04-01 12:49:11: enwiki (ID 91801) 240 pages (1.3|113.9/sec all|curr), 58000 revs (319.3|474.4/sec all|curr), ETA 2019-05-03 19:29:28 [max 890452225]
2019-04-01 12:49:11: enwiki (ID 91791) 544 pages (3.0|283.3/sec all|curr), 58000 revs (319.1|520.8/sec all|curr), ETA 2019-05-03 19:57:21 [max 890452225]
2019-04-01 12:49:12: enwiki (ID 91784) 33 pages (0.2|15.1/sec all|curr), 56000 revs (307.1|458.3/sec all|curr), ETA 2019-05-05 02:09:33 [max 890452225]

I have checked (by logging all queries) to see if we inadvertently load revision content, and at least normally we don't, so that's ruled out as a cause.

Event Timeline

ArielGlenn triaged this task as High priority.Apr 9 2019, 11:24 AM
ArielGlenn created this task.

Here's an explain of a sample query we run to get revision info for stubs:

root@PRODUCTION s4 slave[commonswiki]> explain extended SELECT  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,page_restrictions  FROM `revision` FORCE INDEX (rev_page_id) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `page` ON ((rev_page=page_id))   WHERE (page_id >= 36814248 AND page_id < 36834248) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000;
+------+-------------+---------------------+--------+------------------------+-------------+---------+----------------------------------------------------+--------+----------+-------------
| id   | select_type | table               | type   | possible_keys          | key         | key_len | ref                                                | rows   | filtered | Extra       
+------+-------------+---------------------+--------+------------------------+-------------+---------+----------------------------------------------------+--------+----------+-------------
|    1 | SIMPLE      | revision            | range  | rev_page_id            | rev_page_id | 4       | NULL                                               | 154792 |   100.00 | Using index 
|    1 | SIMPLE      | temp_rev_comment    | ref    | PRIMARY,revcomment_rev | PRIMARY     | 4       | commonswiki.revision.rev_id                        |      1 |   100.00 | Using index 
|    1 | SIMPLE      | comment_rev_comment | eq_ref | PRIMARY                | PRIMARY     | 8       | commonswiki.temp_rev_comment.revcomment_comment_id |      1 |   100.00 |             
|    1 | SIMPLE      | page                | eq_ref | PRIMARY                | PRIMARY     | 4       | commonswiki.revision.rev_page                      |      1 |   100.00 |             
+------+-------------+---------------------+--------+------------------------+-------------+---------+----------------------------------------------------+--------+----------+-------------
4 rows in set, 1 warning (0.00 sec)

For each page we request information on every revision from the slot table one at a time; this is new I guess (though I'm not sure unless I test a run with the Mar 20 config and deployed MW version):

query: SELECT  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content
_id))   WHERE slot_revision_id = '139743823'  
query: SELECT  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content
_id))   WHERE slot_revision_id = '218527275'  
query: SELECT  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` JOIN `content` ON ((slot_content_id = content
_id))   WHERE slot_revision_id = '139743825'

and so on.

These may be perfectly fine in which case we'll have to look for the cause of the slowdown elsewhere in the bowels of MW.

I've checked the so-called adds-changes dumps to see if the numbers pan out. These take a lot less time and we can rerun them with less fuss, so they will be good for testing later. And here's what I have: total time to generate stubs for these dumps, on the following dates:

start date: 20190310 end date: 20190410
date:  20190310 runtime:  5 minutes
date:  20190311 runtime:  7 minutes
date:  20190312 runtime:  7 minutes
date:  20190313 runtime:  6 minutes
date:  20190314 runtime:  6 minutes
date:  20190315 runtime:  6 minutes
date:  20190316 runtime:  6 minutes
date:  20190317 runtime:  6 minutes
date:  20190318 runtime:  7 minutes
date:  20190319 runtime:  6 minutes
date:  20190320 runtime:  7 minutes
date:  20190321 runtime:  6 minutes
date:  20190322 runtime:  6 minutes
date:  20190323 runtime:  5 minutes
date:  20190324 runtime:  5 minutes
date:  20190325 runtime:  5 minutes <--
date:  20190326 runtime:  6 minutes <---
date:  20190327 runtime:  30 minutes <---
date:  20190328 runtime:  49 minutes <---
date:  20190329 runtime:  45 minutes
date:  20190330 runtime:  45 minutes
date:  20190331 runtime:  52 minutes
date:  20190401 runtime:  44 minutes
date:  20190402 runtime:  49 minutes
date:  20190403 runtime:  38 minutes
date:  20190404 runtime:  40 minutes
date:  20190405 runtime:  46 minutes
date:  20190406 runtime:  50 minutes
date:  20190407 runtime:  54 minutes
date:  20190408 runtime:  48 minutes
date:  20190409 runtime:  43 minutes
date:  20190410 runtime:  39 minutes

Most of the time is taken by wikidata, with 500 to 700 thousand new revisions each day.
The big jump happens on Mar 27-28, and on March 27 the only remotely related thing we have in SAL is:

19:18 dduvall@deploy1001: Synchronized php: group1 wikis to 1.33.0-wmf.23 (duration: 01m 45s)

Guess I'll start profiling the wmf32 and wmf33 code and see where the difference is. Note that page content dump times didn't change all that much.

ArielGlenn renamed this task from Xml stubs dumps are running 20x slower than previously to Xml stubs dumps are running 5 to 15x slower than previously.Apr 12 2019, 3:27 PM
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Apr 15 2019, 9:54 AM

I have done the following test: with a copy of 1.34.0-wmf.4 I have run a dump of a range of revisions, and timed it. The first two runs with with unmodified code; the second two were after modifying Revision/RevisionStore::getQueryInfo so that it always includes rev_text_id regardless of the MCR schema setting.

time /usr/bin/php7.2 /srv/mediawiki_mcrtest/multiversion/MWScript.php dumpBackup.php --wiki=wikidatawiki --stub --quiet --output=gzip:/mnt/dumpsdata/temp/dumpsgen/testoutput/wiki\
data-incr-stubs1.xml.gz --revrange --revstart=928911884 --revend=929992568
real    32m6.836s
user    15m25.644s
sys     3m28.492s

rerun:
real    39m49.189s
user    16m0.368s
sys     3m26.640s

after editing Revision/RevisionStore::getQueryInfo:

real    2m58.664s
user    2m39.100s
sys     0m19.688s

rerun:

real    2m53.009s
user    2m31.952s
sys     0m19.468s

The difference is that with the MCR schema and without rev_text_id set, in XmlDumpWriter's writeRevision we invoke newRevisionFromRow, and this gets slot and content info for the revision. We do a separate query for every revision. For the number of revisions in a simple adds/changes dump this amounts to between 500k and 900k revisions, or an additional 5 to 900k queries to the database. This accounts for a several-fold slowdown.

The query in question looks similar to

Query	SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecords  */  slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `slots` INNER JOIN `content` ON ((slot_content_id = content_id))   WHERE slot_revision_id = '805'

We used to have rev_len and rev_text_id right in the row that we relied on for output; now we must include content_size and content_address from the content row corresponding to the slot entry for the given revision. Eventually we'll need the sha1 and other things from the content row too.

We could try to get these with extra joins when we get revision batches (in the case of adds/changes dumps) or revisions for a page (in the case of stubs dumps). I'm not sure how much speed we can get back from that approach, nor am I sure what indices or other fancy tricks we might need or be able to get away with on the vslow/dumps hosts.

Adding @Marostegui per IRC conversation, we'll hopefully chat Thursday about possible approaches.

Without taking a deep look, I would like to comment on this specific approach:

nor am I sure what indices or other fancy tricks we might need or be able to get away with on the vslow/dumps hosts.

If we end up having to add indexes, we should add them to all the hosts, not only vslow/dump ones, as otherwise we'd end up with even more drifts (right now we have special slaves (recentchanges,watchlist..) and normal slaves). Those are already a pain to maintain as they have different PK, indexes and partitions for certain tables, so having even more snowflakes would not be ideal.
Again, I haven't taken a deep look at this, just commenting about that specific point.

I had a chat with @ArielGlenn about this and about the approach for the slow queries mentioned earlier on the task (T220493#5175963)
Looks like we have concluded that we want to explore 3 ways here, some of them might be doable and some others might be not, but worth describing them.

1- Ideally the first one would be to tune the existing queries, to see if the information can be retrieve in some other (faster) way. We would like to see if @daniel or @Anomie could help here and see if there is some way to retrieve the information needed from all the involved tables (revision, slots, content etc) in a different way that would make it faster. And further, if there is some more normalization that will be applied to the revision table, in order to avoid changing things now that will need to be changed again :-)
Can you guys think of a way of a better way to obtain this data with the current set of tables (and specially knowing the future of all this refactoring)?

2- Another approach we have thought of (that still requires testing) would be to avoid doing JOINS on a mysql level and try to do them from code: ie, grabbing the list of revision ids on a list, using those to query the slot and content tables and merging all the obtained information. Doing that in small batches (batch size, to be defined).

3- Lastly, we could try to put some indexes in after testing which ones would be suitable and which ones would speed the queries up.
This last approach is listed at the end, because it will take quite long to deploy once the right indexes are found (it would take long because it means deploying the index to all the hosts, on all the schemas - of course dump hosts would be prioritized, but in the end it needs to arrive everywhere, to avoid snowflakes).

Anomie added a subscriber: BPirkle.May 17 2019, 2:35 PM

The queries quoted in this task seem to be as optimized as they can be, and the needed indexes seem to already be in place. #2 is probably the answer: fetch a batch of revisions, then in PHP collect the revision IDs from the batch and do SELECT slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model FROM `slots` JOIN `content` ON ((slot_content_id = content_id)) WHERE slot_revision_id IN (...) to fetch the slot data all at once and merge that in somehow. Then probably collect the content addresses and use ExternalStore::batchFetchFromURLs() to fetch the actual contents for all those slots in a batch as well.

As for how to make that sanely doable in the context of RevisionStore and related classes (rather than raw SQL queries), I'd suggest some sort of $revisionStore->batchFetchSlotData( RevisionRecord[] $revisions ) method that would do the batch queries as described and populate the slot data into the passed RevisionRecord objects. But I'll leave the decision and details to @daniel and @BPirkle, as they're the ones currently working on MCR.

@Anomie while you're here, do you know if there are further plans to move fields out of the revision table in the short/near term or otherwise to refactor its schema?

Change 511126 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] WIP: batch slot row access

https://gerrit.wikimedia.org/r/511126

Please have a look at the WIP patch linked above. My approach there is to join in the slots and content tables directly instead of doing a second query for each batch to fetch them. The code would not have to change much to adopt that approach, though.

Which one is likely to be better? And does the rest of the code make sense to you?

Just been waiting for the go-ahead before having a look. I'll try a few different things and come back to you in a day or two!

@Anomie while you're here, do you know if there are further plans to move fields out of the revision table in the short/near term or otherwise to refactor its schema?

I'm aware of three kinds of things being removed from revision:

  • content related stuff (rev_content_model, rev_content_format)
  • actor related stuff (rev_user_text)
  • comment related stuff (rev_comment)

I'm not aware of further changes being in the pipeline.

@daniel Any new information that we'll have for slots/content that we need to include, besides length, content model/stuff, and sha1?

daniel added a comment.EditedMay 20 2019, 4:12 PM

@daniel Any new information that we'll have for slots/content that we need to include, besides length, content model/stuff, and sha1?

The slot role.

No need to join against the slot_roles table though, the lookup of the id can and should be done programmatically.

Ah right. OK, that means we shouldn't have to refactor any queries once we get through the actor and comment migration and make sure things aren't too slow.

daniel claimed this task.May 21 2019, 12:45 PM

Hm, my experimental patch makes CI tests time out. I must have broken something, but I really don't see what. Running the "relevant" tests locally, everything seems to be doing fine.

Preliminary testing shows that the above patch is slower than the existing code. It could be there's something wrong with my testing. Investigating.

Preliminary testing shows that the above patch is slower than the existing code. It could be there's something wrong with my testing. Investigating.

Since it's not really clear what's causing the slowness, it's quite possible that I made it worse - i minimized the number of queries, but added two joins and increased the amount of data to be transferred. I'll follow up with a patch that does batching.

ArielGlenn added a comment.EditedMay 29 2019, 8:07 AM

I'm going to be lookig at the run-time query plan of the queries executed with the new patch, and that might shed some light on things too.

I don't see the expected joins on slots and content, so that already raises a red flag. Let me check further in depth.

OK, got a known good test. Timings without logging:

dumpsgen@snapshot1005:/srv/mediawiki$ time /usr/bin/php7.2 /srv/mediawiki_mcrtest/multiversion/MWScript.php dumpBackup.php --wiki=wikidatawiki --stub --quiet --output=gzip:/mnt/dumpsdata/temp/dumpsgen/testoutput/wikidata-incr-stubs-mcr.xml.gz --revrange --revstart=928911884 --revend=929992568
real    41m2.240s
user    20m7.668s
sys     2m15.840s
dumpsgen@snapshot1005:/srv/mediawiki$ time /usr/bin/php7.2 /srv/mediawiki_mcrtest/multiversion/MWScript.php dumpBackup.php --wiki=wikidatawiki --stub --quiet --output=gzip:/mnt/dumpsdata/temp/dumpsgen/testoutput/wikidata-incr-stubs-mcr.xml.gz --revrange --revstart=928911884 --revend=929992568
real    31m33.165s
user    19m8.560s
sys     2m7.560s
dumpsgen@snapshot1005:/srv/mediawiki$ time /usr/bin/php7.2 /srv/mediawiki_mcrtest/multiversion/MWScript.php dumpBackup.php --wiki=wikidatawiki --stub --quiet --output=gzip:/mnt/dumpsdata/temp/dumpsgen/testoutput/wikidata-incr-stubs-mcr.xml.gz --revrange --revstart=928911884 --revend=929992568
real    34m41.707s
user    19m41.676s
sys     2m11.244s

So these are slower than the original run, where the queries now executed are

query: SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,temp_rev_user.revactor_actor AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model,page_restrictions  FROM `page` JOIN `revision` ON ((page_id = rev_page)) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `slots` ON ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (rev_id >= 928911884 AND rev_id < 929992568) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000  

query: SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,temp_rev_user.revactor_actor AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model,page_restrictions  FROM `page` JOIN `revision` ON ((page_id = rev_page)) JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `slots` ON ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (rev_id >= 928911884 AND rev_id < 929992568) AND (rev_page>1473035 OR (rev_page=1473035 AND rev_id>929047004))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000

I guess we should try new patches next with the slot/content batching broken out into a second query and see what that looks like.

So my testing with the separate slot query looks like it's pretty slow too. I'm going to dig in some with profiling on a local install and see whether there's any insight to be had.

daniel added a comment.Jun 4 2019, 6:48 PM

So my testing with the separate slot query looks like it's pretty slow too. I'm going to dig in some with profiling on a local install and see whether there's any insight to be had.

Ooohhh... I think I know what the problem may be. Even with batched queries for slots and content, we are still generating one query per revision to retrieve a row from the text table before querying ExternalStore. This would be fixed by T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production), but that's too far out.

Perhaps a batch (pre)load interface for BlobStore would be in order...

Good catch, that might be possible. I'm not ready to give profiling results, been updating a large wiki to the latest branch on a local install and it's taking quite some time. Tomorrow it should be ready for fresh stats though!

daniel added a comment.Jun 4 2019, 8:23 PM

If my suspicion is correct, adding batching to BlobStore should gets its own subtask. Just create a stub and drop it into the Core Platform Team inbox, I'll flesh it out and prioritize it.

How urgent would you say this is?

If my suspicion is correct, adding batching to BlobStore should gets its own subtask. Just create a stub and drop it into the Core Platform Team inbox, I'll flesh it out and prioritize it.
How urgent would you say this is?

Getting the speed issue fixed is something that we need very soon; we're already at the time limit on wikidata for runs now because of slow stubs/abstracts. But I'd hold off on more patches for a couple of days until profiling gives some clear information one way or another.

ArielGlenn added a comment.EditedJun 5 2019, 2:24 PM

I've found a few things; see diff between patchsets 10 and 11.

We set $rev already once in XmlDumpWriter::writeRevision, taking care to provide the slots to the constructor. So we don't want to initialize $rev again later, I've commented out two such instances.
The new $tables setting was overridden in the stubs stanza, breaking stubs dumps, so I've commented out that as well.

The remaining big spender seems to be calls to LinkCache::fetchPageRow (which eventually do a select for page_id,page_len,page_is_redirect,page_latest,page_restrictions,page_content_model FROM page) and these originate with Title::getArticleID. MediaWiki\Revision\RevisionStore::newRevisionFromRowAndSlots winds its way down to getArticleID eventually, and it shouldn't. We've already got rev_page in the rev row, so somehow that ought to be made available to methods further down the line. Any thoughts?

After that fix, I'd like to reprofile everything and do timing tests to see where we are at.

daniel added a comment.Jun 6 2019, 9:26 AM

@ArielGlenn I updated the patch based on your comments. I found the issue with Title::getArticleID as well. And I rebased the follow-up patch, so you can compare the two as well.

Batching blob access will probably provide a further boost, if you think it would be worthwhile, please create a ticket.

Thanks! Gotcha about the blob batching, I"ll open a task if needed.

The Title::getArticleID issue is elsewhere in the code (though the one you found is a good catch). It's:

XmlDumpWriter::writeRevision ->
MediaWiki\Revision\RevisionStore::newRevisionFromRowAndSlots ->
MediaWiki\Revision\RevisionStoreRecord::construct ->
MediaWiki\Revision\RevisionRecord::
construct ->
Title::getArticleID

separate from the openPage workflow. Any ideas what the right fix is there?

The Title::getArticleID issue is elsewhere in the code (though the one you found is a good catch).

Unless I messed up, the code path you describe will still hit Title::getArticleID, but that Title object will already have the ID set and will not hit the database, because it has been constructed via newFromRow instead of makeTitle. I'm about to leave the house and won't be back properly until Tuesday, so I can't double check myself. Let me know if I got it wrong.

I got the above via profiling, and noticed that Title::getArticleID -> LinkCache::addLinkObj -> LinkCache::fetchPageRow -> Wikimedia\Rdbms\Database::selectRow and then there we are. I can verify this in a test directory on one of the snapshot hosts and drop a stack trace here if that will help.

daniel added a comment.EditedJun 6 2019, 6:14 PM

Are you sure that still happens with the the latest version of the code that uses newFromRow() in openPage()? With the old code in openPage() using makeTitle(), I confirmed the problem you describe. But it should be gone now.

getArticelID() only calls LinkCache::addLinkObj () if $this->mArticleID == -1, and that should no longer be the case, thanks to the page_id being present in the row passed to newFromRow(). Stepping through the code (launched via backup_PageTest) seems to confirm that LinkCache::addLinkObj is no longer triggered from writeRevision().

Change 513073 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Query slots in a separate query when dumping XML

https://gerrit.wikimedia.org/r/513073

Hmm, fewer of those but rather a lot of
SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecords */ slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model FROM slots JOIN content ON ((slot_content_id = content_id)) WHERE slot_revision_id =...

I need to look at this again tomorrow when I'm a little more fresh though. Will update when I have verified info (and maybe another fix).

daniel added a comment.EditedJun 13 2019, 6:24 PM

Hmm, fewer of those but rather a lot of
SELECT /* MediaWiki\Revision\RevisionStore::loadSlotRecords */ slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model FROM slots JOIN content ON ((slot_content_id = content_id)) WHERE slot_revision_id =...
I need to look at this again tomorrow when I'm a little more fresh though. Will update when I have verified info (and maybe another fix).

That's odd, do you have a stack trace for that? Running backup_PageTest.php with the patch applied, I'm not hitting RevisionStore::loadSlotRecords even once.

I've been a bit sandbagged this week which is why you've seen neither a stack trace nor confirmation that I didn't screw up the testing. This weekend, or Monday at the earliest.

I've been a bit sandbagged this week which is why you've seen neither a stack trace nor confirmation that I didn't screw up the testing. This weekend, or Monday at the earliest.

No rush from my side ;)

Note that I'll be at the CPT offsite next week, though.

Cherry picking https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/511126/ and running with the setting

$wgMultiContentRevisionSchemaMigrationStage = SCHEMA_COMPAT_WRITE_BOTH | SCHEMA_COMPAT_READ_OLD;

netted me an exception. Here's the details (with some formatting by me):

Wikimedia\Rdbms\DBQueryError from line 1586 of /home/system/www/html/elwikt/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application'\
s database schema updater after upgrading?
Query:
SELECT  rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,
        comment_rev_comment.comment_text AS `rev_comment_text`,
        comment_rev_comment.comment_data AS `rev_comment_data`,
        comment_rev_comment.comment_id AS `rev_comment_cid`,
        actor_rev_user.actor_user AS `rev_user`,
        actor_rev_user.actor_name AS `rev_user_text`,
        temp_rev_user.revactor_actor AS `rev_actor`,
        rev_text_id,rev_content_format,rev_content_model,
        page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,
        slots.rev_id AS `slot_revision_id`,
        NULL AS `slot_content_id`,
        slots.rev_id AS `slot_origin`,
        'main' AS `role_name`,
        slots.rev_len AS `content_size`,
        slots.rev_sha1 AS `content_sha1`,
        CONCAT('tt:',slots.rev_text_id) AS `content_address`,
        slots.rev_text_id AS `rev_text_id`,
        slots.rev_content_model AS `model_name`,
        page_restrictions,1 AS `_load_content`
FROM `page`
JOIN `revision` ON ((page_id=rev_page AND page_latest=rev_id))
JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id))
JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))
JOIN `revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id))
JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor))

WHERE (page_id >= 1) AND (rev_page>0 OR (rev_page=0 AND rev_id>0))  ORDER BY page_id ASC LIMIT 50000

Function: WikiExporter::dumpPages
Error: 1054 Unknown column 'slots.rev_id' in 'field list' (localhost)
#0 /home/system/www/html/elwikt/includes/libs/rdbms/database/Database.php(1554): Wikimedia\Rdbms\Database->getQueryExceptionAndLog('Unknown column ...', 1054, 'SELECT  rev_id,...', 'WikiExporter:0:d...')
#1 /home/system/www/html/elwikt/includes/libs/rdbms/database/Database.php(1274): Wikimedia\Rdbms\Database->reportQueryError('Unknown column ...', 1054, 'SELECT  rev_id,...', 'WikiExporter::d...', false)
#2 /home/system/www/html/elwikt/includes/libs/rdbms/database/Database.php(1782): Wikimedia\Rdbms\Database->query('SELECT  rev_id,...', 'WikiExporter::d...')
#3 /home/system/www/html/elwikt/includes/export/WikiExporter.php(458): Wikimedia\Rdbms\Database->select(Array, Array, Array, 'WikiExporter::d...', Array, Array)
#4 /home/system/www/html/elwikt/includes/export/WikiExporter.php(275): WikiExporter->dumpPages('page_id >= 1', false)
#5 /home/system/www/html/elwikt/includes/export/WikiExporter.php(160): WikiExporter->dumpFrom('page_id >= 1', false)
#6 /home/system/www/html/elwikt/maintenance/includes/BackupDumper.php(289): WikiExporter->pagesByRange(1, 0, false)
#7 /home/system/www/html/elwikt/maintenance/dumpBackup.php(84): BackupDumper->dump(2, 0)
#8 /home/system/www/html/elwikt/maintenance/doMaintenance.php(99): DumpBackup->execute()
#9 /home/system/www/html/elwikt/maintenance/dumpBackup.php(144): require_once('/home/system/ww...')
#10 {main}

I see this for any of abstracts, stubs with full history, or revision range dumps on a local install.

Testing with 1.34.0-wmf8 + cherry-pick.

Note that with READ_NEW I was able to test for speed improvements and everything looks good now, as mentioned on the patchset.

Change 511126 merged by jenkins-bot:
[mediawiki/core@master] Join slot and content tables when dumping XML

https://gerrit.wikimedia.org/r/511126

Change 519953 had a related patch set uploaded (by ArielGlenn; owner: Daniel Kinzler):
[mediawiki/core@wmf/1.34.0-wmf.11] Join slot and content tables when dumping XML

https://gerrit.wikimedia.org/r/519953

Change 519953 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.11] Join slot and content tables when dumping XML

https://gerrit.wikimedia.org/r/519953

Mentioned in SAL (#wikimedia-operations) [2019-07-01T11:24:23Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.11/includes/: SWAT: [[:gerrit:519953|Join slot and content tables when dumping XML]] (T220493) (duration: 01m 14s)

Mentioned in SAL (#wikimedia-operations) [2019-07-01T11:27:19Z] <apergos> urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/includes/: SWAT: [[:gerrit:519953|Join slot and content tables when dumping XML]] (T220493) (duration: 01m 14s)

There may be more optimizations that can be done in the future (text row batching, determine whether getting slot batches separately from the main query is faster), but the gains from the merged patchset are already big, as seen on this month's run. We started 17 hours late but stubs are done for small/big wikis and are finishing up for enwiki and wikidatawiki. Closing this ticket. Thanks, everybody!

ArielGlenn closed this task as Resolved.Jul 3 2019, 6:01 AM
ArielGlenn moved this task from Active to Done on the Dumps-Generation board.