Page MenuHomePhabricator

Replace the WikiExporter backup dump streaming mode with batched queries
Closed, ResolvedPublic

Description

dumpBackup.php uses WikiExporter's streaming mode, which is to say, it sets unbuffered mode on a database connection and attempts to get all the data it needs in a single enormous SQL query. I propose breaking up this enormous query into batches and using the buffered mode.

Using the unbuffered mode is fragile and overly complicated from a development perspective. It is presumably not portable across all DBMSs. It makes the code difficult to test due to the way it interacts with PHPUnit database setup. Recently, MediaWikiTestCase::prepareConnectionForTesting() was added, solely to support testing of the backup code. I think that method is a barrier to refactoring of MediaWikiTestCase's database handling.

Using unbuffered mode may improve performance slightly. After the script consumes a row, the DB server is presumably free to do I/O work to produce the next row while PHP is busy processing the row. In production, this advantage only applies to stub dumps. Most of the time to generate dumps is spent in dumpTextPass.php which does not use unbuffered mode. So at best, unbuffered mode provides only a small improvement, while other proposed changes to the dump infrastructure, such as parallel or incremental dump generation, could improve the speed by orders of magnitude.

Unbuffered mode is an outdated style. In early versions of MediaWiki, we used it often, but the implementation complexities led to it being removed everywhere except WikiExporter and a MediaWiki 1.5 upgrade script called convertLinks.php. With this change, we will be very close to being able to remove it completely.

My proposal in detail:

  • Refactor dumpFrom(), split log dumps out to a separate function
  • In the page dump section, be consistent about rev_page/rev_id ordering
  • The start conditions for each batch can be: rev_page>$page or (rev_page=$page AND rev_id>$id)
  • Keep track of the number of revisions seen. Respect the $this->history['limit'] parameter by adjusting the batch size downwards if necessary.
  • The ModifyExportQuery hook will apparently work fine with batched queries

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenBUG REPORTNone
OpenNone
StalledNone
OpenFeatureNone
DuplicateNone
ResolvedNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
ResolvedNone
ResolvedNone
OpenFeatureNone
OpenNone
OpenFeatureNone
StalledNone
OpenNone
InvalidNone
StalledNone
ResolvedAnomie
Resolveddaniel
ResolvedAnomie
ResolvedAnomie
OpenNone
Resolved Bstorm
ResolvedNone
ResolvedZabe
Resolveddaniel
Resolveddaniel
Resolved Clarakosi
ResolvedNone
ResolvedNone
ResolvedAnomie
ResolvedNone
ResolvedCCicalese_WMF
Resolved Pchelolo
Resolved Pchelolo
Resolved Pchelolo
ResolvedCCicalese_WMF
Resolveddaniel
Resolved Pchelolo
ResolvedNone
ResolvedNone
Resolveddaniel
Resolveddaniel
ResolvedBPirkle

Event Timeline

I notice that WikiExporter::STREAM is also used in SpecialExport.php, for the Special:Export page, which provides similar functionality via the UI as dumpBackup.php does via the command line. Removing unbuffered mode from WikiExporter.php will necessarily (and happily) remove this as well, with only a trivial change to SpecialExport.php.

The dumpFrom() function (which will be split up per the task description) potentially sorts the db results by rev_timestamp in DESC order. Using this option would require callers to pass an array instead of a constant as the $history parameter to the WikiExporter constructor, which is inconsistent.

This sorting option complicates the batch conditions. While I could make it work, I don't see anywhere that functionality is used, either in core code or by any code in the extensions repository. In all cases I could find, $history is passed as a constant rather than as an array. If this functionality is indeed unused, can I take this opportunity to simplify WikiExporter by removing that entirely and recognizing only constants as valid values for $history?

FWIW, the code relevant to this sorting (search for "rev_timestamp DESC" in WikiExporter.php) has been around since the initial commit in 2015. So it wasn't recently added in support of new incoming functionality or anything of that nature.

Nevermind, I spoke too soon. While I didn't see a way to invoke the sorting code via the UI, I found https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export, which describes using curl to invoke it manually or via script. So that seems to be an officially supported usage.

I guess I do need to work out the batching in combination with the sorting.

Change 459885 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Replace WikiExporter streaming (unbuffered) mode with batched queries

https://gerrit.wikimedia.org/r/459885

Change 460349 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[operations/mediawiki-config@master] Add export logging channel

https://gerrit.wikimedia.org/r/460349

Change 460351 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/Flow@master] Minor cleanup to match changes to WikiExporter

https://gerrit.wikimedia.org/r/460351

Change 461647 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Logging related to WikiExport cleanup and changes in T203424

https://gerrit.wikimedia.org/r/461647

Change 460349 merged by jenkins-bot:
[operations/mediawiki-config@master] Add export logging channel

https://gerrit.wikimedia.org/r/460349

Change 461647 merged by jenkins-bot:
[mediawiki/core@master] Logging related to WikiExport cleanup and changes in T203424

https://gerrit.wikimedia.org/r/461647

Change 462844 had a related patch set uploaded (by Tim Starling; owner: BPirkle):
[mediawiki/core@wmf/1.32.0-wmf.23] [1.32.0-wmf.23] Logging related to WikiExport cleanup and changes in T203424

https://gerrit.wikimedia.org/r/462844

Change 462844 merged by jenkins-bot:
[mediawiki/core@wmf/1.32.0-wmf.23] [1.32.0-wmf.23] Logging related to WikiExport cleanup and changes in T203424

https://gerrit.wikimedia.org/r/462844

Regarding possible cleanup of the $history parameter, logging shows that in the (roughly) 18 hours preceding the posting of this comment, Special:Export has received 57,986 POST requests across all logged wikis (including enwiki). Of those, 17 have included the sorting option (specifically, a 'desc' value for the "dir" value of $history).

Change 459885 merged by jenkins-bot:
[mediawiki/core@master] Replace WikiExporter streaming (unbuffered) mode with batched queries

https://gerrit.wikimedia.org/r/459885

Change 460351 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Minor cleanup to match changes to WikiExporter

https://gerrit.wikimedia.org/r/460351

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/459885/ seems to have broken Flow dumps, which use streams:

2018-10-03 11:23:53 failed history content of flow pages in xml format
[48f79972c1cea21e104aa6a5] [no req] Error from line 82 of /srv/mediawiki/php-1.32.0-wmf.24/extensions/Flow/maintenance/dumpBackup.php: Undefined class constant 'STREAM' 
Backtrace: 
#0 /srv/mediawiki/php-1.32.0-wmf.24/extensions/Flow/maintenance/dumpBackup.php(62): FlowDumpBackup->dump(integer) 
#1 /srv/mediawiki/php-1.32.0-wmf.24/maintenance/doMaintenance.php(94): FlowDumpBackup->execute() 
#2 /srv/mediawiki/php-1.32.0-wmf.24/extensions/Flow/maintenance/dumpBackup.php(144): require_once(string) 
#3 /srv/mediawiki/multiversion/MWScript.php(100): require_once(string) 
#4 {main}

I guess it needs https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Flow/+/460351/ to go out to wmf/1.32.0-wmf.24 as well.

Change 464167 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/extensions/Flow@wmf/1.32.0-wmf.24] Minor cleanup to match changes to WikiExporter

https://gerrit.wikimedia.org/r/464167

Change 464167 merged by jenkins-bot:
[mediawiki/extensions/Flow@wmf/1.32.0-wmf.24] Minor cleanup to match changes to WikiExporter

https://gerrit.wikimedia.org/r/464167

Mentioned in SAL (#wikimedia-operations) [2018-10-03T16:24:16Z] <reedy@deploy1001> Synchronized php-1.32.0-wmf.24/extensions/Flow/: fixup flow exporting T203424 (duration: 01m 03s)

That fixed the problem, thanks!