Page MenuHomePhabricator

Flow: Handle missing content when dumping
Closed, ResolvedPublic

Description

Due to T95580: Flow data missing on Wikimedia production wikis there can be missing content (mainly/solely on test wikis). This needs to be handled when dumping.

ArielGlenn had this happen on testwiki (T89398: Add Flow to database dumps):

command /usr/bin/php5 -q /srv/mediawiki/multiversion/MWScript.php extensions/Flow/maintenance/dumpBackup.php --wiki=testwiki --current --report=1000 --output=bzip2:/mnt/data/xmldatadumps/public/testwiki/20160601/testwiki-20160601-flow.xml.bz2 (21805) started...
[effb0c25c66162949d5acb27] [no req]   Flow\Exception\InvalidDataException from line 366 of /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Model/AbstractRevision.php: Failed to load the content
Backtrace:
#0 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(395): Flow\Mode\\AbstractRevision->getContent(string)
#1 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(360): Flow\Dump\Exporter->formatRevision(Flow\Model\PostRevision)
#2 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(255): Flow\Dump\Exporter->formatRevisions(Flow\Model\PostRevision)
#3 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(213): Flow\Dump\Exporter->formatPost(Flow\Model\PostRevision)
#4 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(196): Flow\Dump\Exporter->formatTopic(Flow\Model\PostRevision)
#5 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(174): Flow\Dump\Exporter->formatWorkflow(Flow\Model\Workflow, Flow\Search\Iterators\HeaderIterator, Flow\Search\Iterators\TopicIterator)
#6 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/maintenance/dumpBackup.php(85): Flow\Dump\Exporter->dump(BatchRowIterator)
#7 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/maintenance/dumpBackup.php(58): FlowDumpBackup->dump(integer)
#8 /srv/mediawiki/php-1.28.0-wmf.7/maintenance/doMaintenance.php(103): FlowDumpBackup->execute()
#9 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/maintenance/dumpBackup.php(124): require_once(string)
#10 /srv/mediawiki/multiversion/MWScript.php(97): require_once(string)
#11 {main}

Event Timeline

Mattflaschen-WMF renamed this task from Handle missing content when dumping to Flow: Handle missing content when dumping.Jul 8 2016, 8:22 PM
Mattflaschen-WMF updated the task description. (Show Details)
Hydriz triaged this task as High priority.Aug 16 2016, 11:21 AM
Hydriz added a subscriber: Hydriz.

Is there any progress on this? The past few dumps of testwiki has been failing since dumping of Flow data was introduced.

mysql> SELECT wiki,dumpdate,progress FROM dumps WHERE wiki="testwiki";
+----------+------------+----------+
| wiki     | dumpdate   | progress |
+----------+------------+----------+
| testwiki | 2014-08-13 | done     |
| testwiki | 2014-09-02 | done     |
| testwiki | 2014-09-26 | done     |
| testwiki | 2014-10-24 | done     |
| testwiki | 2014-11-21 | done     |
| testwiki | 2014-12-20 | done     |
| testwiki | 2015-02-16 | done     |
| testwiki | 2015-03-12 | done     |
| testwiki | 2015-04-05 | done     |
| testwiki | 2015-04-30 | done     |
| testwiki | 2015-05-16 | error    |
| testwiki | 2015-06-02 | done     |
| testwiki | 2015-07-02 | done     |
| testwiki | 2015-08-05 | done     |
| testwiki | 2015-08-26 | done     |
| testwiki | 2015-09-01 | done     |
| testwiki | 2015-10-02 | done     |
| testwiki | 2015-10-20 | done     |
| testwiki | 2015-11-02 | done     |
| testwiki | 2015-11-23 | done     |
| testwiki | 2015-12-01 | done     |
| testwiki | 2015-12-26 | done     |
| testwiki | 2016-01-11 | done     |
| testwiki | 2016-02-03 | done     |
| testwiki | 2016-03-05 | done     |
| testwiki | 2016-04-07 | done     |
| testwiki | 2016-05-01 | done     |
| testwiki | 2016-06-01 | error    |
| testwiki | 2016-07-01 | error    |
| testwiki | 2016-07-20 | error    |
| testwiki | 2016-08-01 | error    |
+----------+------------+----------+
31 rows in set (0.00 sec)

No, but only the Flow dumps are failing. Everything else is still being dumped as normal.

Change 376478 had a related patch set uploaded (by Mattflaschen; owner: Mattflaschen):
[mediawiki/extensions/Flow@master] Move handling for missing post content to lower level

https://gerrit.wikimedia.org/r/376478

^ That is not a complete solution. Although it should not trigger an error, the dump XML files should include whether this case occurred (probably in an attribute), so consumers of the dumps can know.

In revision content dumps for non-flow pages, we have the attribute "deleted" which is given for revision content that is gone after the stub is produced. Perhaps something like that could be used here. See https://www.mediawiki.org/xml/export-0.10.xsd for how it's used.

Change 376564 had a related patch set uploaded (by Chad; owner: Mattflaschen):
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Move handling for missing post content to lower level

https://gerrit.wikimedia.org/r/376564

Change 376478 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Move handling for missing post content to lower level

https://gerrit.wikimedia.org/r/376478

Change 376564 abandoned by Catrope:
Move handling for missing post content to lower level

Reason:
wmf.17 has been superseded

https://gerrit.wikimedia.org/r/376564

So. @Catrope has generously volunteered to whack away at this. Thanks! I have tried to unbitrot my stubs patches. They should be ok, running locally they DTRT except of course retrieving full revision content when they shouldn't. If you run into trouble, please holler.

To test, you want:

Here is the 'stubs' (metadata dump only) command you would run (adjust for your own paths and versions and so on), no python dumps scripts needed:
/usr/bin/php5 /srv/mediawiki/multiversion/MWScript.php extensions/Flow/maintenance/dumpBackup.php --wiki=mediawikiwiki --report=1000 --output=gzip:/home/catrope/mediawikiwik-flowstubshistory.xml.gz --stub --full

Matt's patch for content vs metadata retrieval that was merged is here: https://gerrit.wikimedia.org/r/#/c/376478/ but that only eliminates some of the preloading cases.
Once stub dumps can run without preloading content, we should check performance impact of the new patch before merge.

OK, that was on the wrong flow ticket. Moving it to the right one now.

Pppery added a subscriber: Pppery.

Patch was merged. Can this be closed as resolved?

Flow dumps have been running for the past year without incident so let's call it good.