Page MenuHomePhabricator

Add Flow to database dumps
Closed, ResolvedPublic

Description

We need to include Flow in the database dumps (see architecture notes at T93925: T5. Determine database dump architecture and file format (spike)).

With the current state, I get an exception when doing a full-wiki import that contains Flow pages:

PHP Notice:  LoadBalancer::rollbackMasterChanges: Flushing an explicit transaction, getting out of sync! [Called from DatabaseBase::rollback in /vagrant/mediawiki/includes/db/Database.php at line 3739] in /vagrant/mediawiki/includes/debug/MWDebug.php on line 300
PHP Stack trace:
PHP   1. MWExceptionHandler::handleException() /vagrant/mediawiki/includes/exception/MWExceptionHandler.php:0
PHP   2. MWExceptionHandler::rollbackMasterChangesAndLog() /vagrant/mediawiki/includes/exception/MWExceptionHandler.php:156
PHP   3. LBFactory->rollbackMasterChanges() /vagrant/mediawiki/includes/exception/MWExceptionHandler.php:135
PHP   4. LBFactory->forEachLBCallMethod() /vagrant/mediawiki/includes/db/LBFactory.php:192
PHP   5. LBFactorySimple->forEachLB() /vagrant/mediawiki/includes/db/LBFactory.php:177
PHP   6. call_user_func_array() /vagrant/mediawiki/includes/db/LBFactory.php:321
PHP   7. LBFactory->{closure:/vagrant/mediawiki/includes/db/LBFactory.php:175-177}() /vagrant/mediawiki/includes/db/LBFactory.php:321
PHP   8. call_user_func_array() /vagrant/mediawiki/includes/db/LBFactory.php:176
PHP   9. LoadBalancer->rollbackMasterChanges() /vagrant/mediawiki/includes/db/LBFactory.php:176
PHP  10. DatabaseBase->rollback() /vagrant/mediawiki/includes/db/LoadBalancer.php:1003
PHP  11. wfWarn() /vagrant/mediawiki/includes/db/Database.php:3739
PHP  12. MWDebug::warning() /vagrant/mediawiki/includes/GlobalFunctions.php:1231
PHP  13. MWDebug::sendMessage() /vagrant/mediawiki/includes/debug/MWDebug.php:155
PHP  14. trigger_error() /vagrant/mediawiki/includes/debug/MWDebug.php:300

Notice: LoadBalancer::rollbackMasterChanges: Flushing an explicit transaction, getting out of sync! [Called from DatabaseBase::rollback in /vagrant/mediawiki/includes/db/Database.php at line 3739] in /vagrant/mediawiki/includes/debug/MWDebug.php on line 300

Call Stack:
 2039.4592   41394384   1. MWExceptionHandler::handleException() /vagrant/mediawiki/includes/exception/MWExceptionHandler.php:0
 2039.4592   41394480   2. MWExceptionHandler::rollbackMasterChangesAndLog() /vagrant/mediawiki/includes/exception/MWExceptionHandler.php:156
 2039.6848   41395592   3. LBFactory->rollbackMasterChanges() /vagrant/mediawiki/includes/exception/MWExceptionHandler.php:135
 2039.6848   41395712   4. LBFactory->forEachLBCallMethod() /vagrant/mediawiki/includes/db/LBFactory.php:192
 2039.6848   41396600   5. LBFactorySimple->forEachLB() /vagrant/mediawiki/includes/db/LBFactory.php:177
 2039.6849   41397128   6. call_user_func_array() /vagrant/mediawiki/includes/db/LBFactory.php:321
 2039.6849   41397200   7. LBFactory->{closure:/vagrant/mediawiki/includes/db/LBFactory.php:175-177}() /vagrant/mediawiki/includes/db/LBFactory.php:321
 2039.6849   41397640   8. call_user_func_array() /vagrant/mediawiki/includes/db/LBFactory.php:176
 2039.6849   41397904   9. LoadBalancer->rollbackMasterChanges() /vagrant/mediawiki/includes/db/LBFactory.php:176
 2039.6849   41398400  10. DatabaseBase->rollback() /vagrant/mediawiki/includes/db/LoadBalancer.php:1003
 2039.6850   41398592  11. wfWarn() /vagrant/mediawiki/includes/db/Database.php:3739
 2039.6850   41398872  12. MWDebug::warning() /vagrant/mediawiki/includes/GlobalFunctions.php:1231
 2039.6851   41400920  13. MWDebug::sendMessage() /vagrant/mediawiki/includes/debug/MWDebug.php:155
 2039.6851   41401224  14. trigger_error() /vagrant/mediawiki/includes/debug/MWDebug.php:300

[0d44a313] [no req]   Flow\Exception\UnknownWorkflowIdException from line 119 of /vagrant/mediawiki/extensions/Flow/includes/WorkflowLoaderFactory.php: Invalid workflow requested by id
Backtrace:
#0 /vagrant/mediawiki/extensions/Flow/includes/WorkflowLoaderFactory.php(74): Flow\WorkflowLoaderFactory->loadWorkflowById(Title, Flow\Model\UUID)
#1 /vagrant/mediawiki/extensions/Flow/includes/Content/BoardContent.php(196): Flow\WorkflowLoaderFactory->createWorkflowLoader(Title, Flow\Model\UUID)
#2 /vagrant/mediawiki/includes/page/WikiPage.php(2129): Flow\Content\BoardContent->getParserOutput(Title, integer, ParserOptions)
#3 /vagrant/mediawiki/includes/page/WikiPage.php(2180): WikiPage->prepareContentForEdit(Flow\Content\BoardContent, Revision, User)
#4 /vagrant/mediawiki/includes/Import.php(1611): WikiPage->doEditUpdates(Revision, User, array)
#5 [internal function]: WikiRevision->importOldRevision()
#6 /vagrant/mediawiki/includes/db/Database.php(3340): call_user_func_array(array, array)
#7 /vagrant/mediawiki/includes/Import.php(325): DatabaseBase->deadlockLoop(array)
#8 [internal function]: WikiImporter->importRevision(WikiRevision)
#9 /vagrant/mediawiki/maintenance/importDump.php(173): call_user_func(array, WikiRevision)
#10 [internal function]: BackupReader->handleRevision(WikiRevision, WikiImporter)
#11 /vagrant/mediawiki/includes/Import.php(457): call_user_func_array(array, array)
#12 /vagrant/mediawiki/includes/Import.php(830): WikiImporter->revisionCallback(WikiRevision)
#13 /vagrant/mediawiki/includes/Import.php(777): WikiImporter->processRevision(array, array)
#14 /vagrant/mediawiki/includes/Import.php(726): WikiImporter->handleRevision(array)
#15 /vagrant/mediawiki/includes/Import.php(550): WikiImporter->handlePage()
#16 /vagrant/mediawiki/maintenance/importDump.php(299): WikiImporter->doImport()
#17 /vagrant/mediawiki/maintenance/importDump.php(257): BackupReader->importFromHandle(resource)
#18 /vagrant/mediawiki/maintenance/importDump.php(102): BackupReader->importFromFile(string)
#19 /vagrant/mediawiki/maintenance/doMaintenance.php(103): BackupReader->execute()
#20 /vagrant/mediawiki/maintenance/importDump.php(304): require_once(string)
#21 {main}

It seems it's confused because the workflow referenced in the JSON content does not exist.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Stalled until making more progress on T114703, where I'll be figuring out exactly what data I need to include in the dump in order to be able to import it.

Tested with the following. Main thing I didn't cover is the ID from/to, but I'm sure @Etonkovidova can see what I didn't think of.

full-wiki

  • Destroy
  • Rebuild
  • Setup wikis. Make sure there are:
    • Multiple pages
      • Enabled board 1 (S:EF/Special:EnableFlow)
      • Wiki:Enabled board (S:EF)
      • User talk:Admin (Beta)
      • Flow test talk:Flow namespace 1
    • Headers
      • Flow namespace 1 without a header
    • Multiple topics
    • A few topic summaries
    • IP post
  • Some history
    • Edited posts
    • Edited summary
    • Edited header
    • Edited topic title
    • Delete whole board and record title and Flow board ID (Flow test talk:Deleting_whole_board and ...)
    • Delete a couple posts
    • Delete a couple topics
    • Suppress a couple posts
    • Suppress a couple topics
    • Restore one deleted post
    • Restore one deleted topic
    • Restore one suppressed post
    • Restore one suppressed topic
    • Delete, suppress, unsuppress post (unsuppress apparently goes directly to visible)
  • Export full wiki with all history
    • cd /vagrant/mediawiki/extensions/Flow; mwscript extensions/Flow/maintenance/dumpBackup.php --full 2>&1 | tee /vagrant/Flow-dump-full-wiki-all-history-$(git rev-parse --short HEAD).xml
  • Destroy
  • Rebuild
  • Full wiki import
    • cd /vagrant/mediawiki/extensions/Flow; mwscript maintenance/importDump.php < /vagrant/Flow-dump-full-wiki-all-history-$(git rev-parse --short HEAD).xml 2>&1 | tee "/vagrant/Flow-dump-full-wiki-import-$(date).log"
    • Check debug log file (https://www.mediawiki.org/wiki/Manual:$wgDebugLogFile)

pages with current

  • Add custom user (who won't exist at destination)
  • Post as that user (done, to Flow_test_talk:Flow_namespace_1)
  • Export specific boards with current (User talk:Admin, Wiki:Enabled board, Flow_test_talk:Flow_namespace_1)
    • cd /vagrant/mediawiki/extensions/Flow; mwscript extensions/Flow/maintenance/dumpBackup.php --current --pagelist=/vagrant/pagelist-2015-11-24.txt 2>&1 | tee /vagrant/Flow-dump-pagelist-current-$(git rev-parse --short HEAD).xml
  • Destroy
  • Rebuild
  • Create other boards
  • Import those boards
    • cd /vagrant/mediawiki/extensions/Flow; mwscript maintenance/importDump.php < /vagrant/Flow-dump-pagelist-current-$(git rev-parse --short HEAD).xml 2>&1 | tee "/vagrant/Flow-dump-pagelist-current-import-$(date).log"
Catrope subscribed.

Ah, I see, reversed dependencies.

I've got a gerrit changeset which needs testing (https://gerrit.wikimedia.org/r/#/c/282883/).

I expect to be able to pass --output:bz2:/mnt/data/xmldatadumps/..../mediawikiwiki-MMDDYY_flow.xml.bz to Flow/maintenance/dumpBackup.php, and indeed I can but it doesn't actually write to the specified output file like other dump jobs do. Can someone have a look?

Change 283117 had a related patch set uploaded (by Mattflaschen):
Fix sink-handling

https://gerrit.wikimedia.org/r/283117

I expect to be able to pass --output:bz2:/mnt/data/xmldatadumps/..../mediawikiwiki-MMDDYY_flow.xml.bz to Flow/maintenance/dumpBackup.php, and indeed I can but it doesn't actually write to the specified output file like other dump jobs do. Can someone have a look?

There was a bug on our side, fixed in https://gerrit.wikimedia.org/r/283117 .

For those testing, it's:

--output=bzip2:/mnt/data/xmldatadumps/..../mediawikiwiki-MMDDYY_flow.xml.bz

Note equal sign and bzip2, not bz2. It is correct in that Gerrit, but I haven't reviewed that otherwise.

Yep sorry, I guarantee I got the option right when doing my manual tests, shoulda copypasted here.

datasets@snapshot1005:~$ php /srv/mediawiki/multiversion/MWScript.php extensions/Flow/maintenance/dumpBackup.php --wiki mediawikiwiki --current --output=bzip2:/home/datasets/testing/blot.bz2

Currently producing output as expected in the specified file in bz2 format, just waiting to make sure it completes properly before I give the total thumbs up. Looks good though.

(edited) Finished and looks good.

Because both current and full dumps can be produced for Flow pages according to the options for the script at any rate, do you have a means for full dumps to be produced by reading the old fulls and polling the db only for new/changed revisions? Or would we have to go to the db for each revision each time?

We do this for the regular xml dumps but it requires a 2-stage process, involving dumping the metadata for each page/revision first and using that to decide which revisions to grab from the old fulls, in case some pages/revisions have been deleted or oversighted in the meantime. I don't know if it is possible to delete or oversight Flow pages; can you give me a bit of an overview or point me to something to read? Thanks.

Change 283117 merged by jenkins-bot:
Fix sink-handling

https://gerrit.wikimedia.org/r/283117

@ArielGlenn: Where does that 2-stage process happen? I took a quick look at core's dumpBackup.php but that one didn't seem to take an existing dump file to work with.

It is possible to delete/oversight Flow pages: https://www.mediawiki.org/wiki/Extension:Flow/Moderation

Do you think being able to dump a range of revisions is useful at all?
If yes, how should that work? Current code just checks if the last change in a topic is within that range and includes the entire topic in such case
But we should probably only include the specific revisions in that range. Which means that some history (earlier edits of a post/topic) & context (post may have been in that range, but topic was created earlier) may be missing.
If we were to include only specific revisions, an import would change because of that missing context/history. Should we:

  • assume that said history/context is already in place (e.g. through earlier imports) and just let it fail if it isn't
  • product dumps as if there was no earlier history & with additional context where needed (so these partial dumps can be imported standalone)
  • just get rid of revrange because it doesn't make much sense in Flow's context as it is :)

These dumps should make it possible for someone to set up a mirror (right to fork). That means that all data available to the public should be provided. IMHO:

"Current" ought to dump topics currently visible on a page.
"Full" ought to dump the history (whatever that means), i.e. all topics that were ever on the page, that are not deleted/made private.
"Revision range" sounds pretty useless for your model.

Does that make sense so far?

"Current" ought to dump topics currently visible on a page.

It does. Also, it's the latest version of everything (topic titles, posts, board-level descriptions, and topic-level summaries can all be edited and are versioned).

"Full" ought to dump the history (whatever that means), i.e. all topics that were ever on the page, that are not deleted/made private.

Yes, this has all the versions (every rename of a renamed topic title, every edit of a post, summary, header, etc.) of the versioned content. Topics can not be disassociated from the board, so otherwise this is mostly the same as current.

In both cases (current and full), private data (defined as "not visible to an anonymous user") is removed. In the case of "full", that requires reconstructing the history to make it importable (it makes some internal changes like updating the previous revision pointer).

"Revision range" sounds pretty useless for your model.

In theory, it could maybe be useful to someone, but right now it's implemented but broken (and beyond the superficial breakage there are some more complicated issues as Matthias raised). So if it doesn't seem useful, I think I'll just rip it out for now.

Change 283328 had a related patch set uploaded (by Mattflaschen):
Remove revision range

https://gerrit.wikimedia.org/r/283328

Change 283328 merged by jenkins-bot:
Remove revision range

https://gerrit.wikimedia.org/r/283328

When will https://gerrit.wikimedia.org/r/283117 land in a deployed branch? So far it seems only to be available in master.

When will https://gerrit.wikimedia.org/r/283117 land in a deployed branch? So far it seems only to be available in master.

It should be in 1.27.0-wmf.22 this Tue., April 26th. The branch cuts were paused a week due to the data center switchover: https://www.mediawiki.org/wiki/MediaWiki_1.27/Roadmap .

Ok, I see from https://wikitech.wikimedia.org/wiki/Deployments that barring any issues it will be deployed to all wikis Thursday evening (my Friday morning). I'm off on Friday so that means we probably won't make the May run for these dumps, but I can certainly test and commit the code so it's ready for the following month.

Change 282883 had a related patch set uploaded (by ArielGlenn):
add dumps for flow pages for those wikis which have Flow enabled

https://gerrit.wikimedia.org/r/282883

Change 282883 merged by ArielGlenn:
add dumps for flow pages for those wikis which have Flow enabled

https://gerrit.wikimedia.org/r/282883

This has been tested in a production environment and is ready to go. It will be deployed before the next run. I'll keep this ticket open til then as a reminder.

The new dump run shows a lot of failures for flow; these are on dbs that don't in fact have flow active. The code currently tries to run over wikis that are private or closed, insofar as they are not explicitly excluded in the nonflowdb list. That's an error; T137360 will clear that up.

Checking a couple of wikis where flow is in fact enabled, officewiki (private) and mediawikiwiki (regular wiki) both show xml dump files produced for flow pages successfully, as part of the month's dump run.

Checking a couple of wikis where flow is in fact enabled, officewiki (private) and mediawikiwiki (regular wiki) both show xml dump files produced for flow pages successfully, as part of the month's dump run.

When can this task be closed? After all wikis with a Flow page have had their first Flow dump generated?

I can close it after the current run completes, I push out a fix to get the right list of flow-enabled wikis, and I run a noop on all wikis that claim failure (they simply don't have flow enabled). That should be in about 5-6 days.

T119511: Publish recurring Flow dumps at http://dumps.wikimedia.org/ is also now a blocker for this, so they should be set to auto-recur before these two tasks are closed.

There's nothing needed for autorecurrence; the flow job is part of the regular dump runs.

This looks good, except for two things:

  • There are only current dumps. We need full too (--full).
  • Small issues with which wikis, below.

I can close it after the current run completes, I push out a fix to get the right list of flow-enabled wikis, and I run a noop on all wikis that claim failure (they simply don't have flow enabled). That should be in about 5-6 days.

As Roan mentioned at T137360: Handle 'dynamic' dblists, you can use flow.dblist, which is now not dynamic.

After that, let's check that Wikidata, etc. are present, and there are no errors for the non-Flow wikis, then we can move this to QA.

Thanks.

Non-flow wikis now do not get dumped (see the current https://dumps.wikimedia.org/backup-index.html page).

Wikidatawiki flow pages have just been dumped.

I have yet to add a job for --full. (Is there code in MW to support that?)

Change 295587 had a related patch set uploaded (by ArielGlenn):
add job that dumps history of flow pages

https://gerrit.wikimedia.org/r/295587

I'll test the above changeset manually in the next couple of days and see if we get results that differ from dumps of the "current" flow content.

Now testing on mediawiki via manual run.

SOofar it looks ok. I am now running flow history dumps across all wikis with flow enabled.

All completed except for mediawiki, which is taking rather a long time (days). Anyone have a clue why it should be so slow? Is there something different about the tables, or are there that many more flow entries?

Woops, missed ptwikibooks too, due to a bad grep on my part.

I looked at the number of entries in the flow_workflows tables for mediawiki and for ptwikibooks: 31k vs 26k. I'm running the flow history dumps for ptwikibooks now manually and we'll see how long it takes. There's no reason for so few rows to take days to dump; can retrieval and formatting of the revisions really take that long? Some investigation is needed. I'll report the length of time needed for ptwikibooks, that may shed some light on things.

Change 295587 merged by ArielGlenn:
add job that dumps history of flow pages

https://gerrit.wikimedia.org/r/295587

Another thing for the Flow extension folks to look at is the failures on testwiki. I get this exception:

command /usr/bin/php5 -q /srv/mediawiki/multiversion/MWScript.php extensions/Flow/maintenance/dumpBackup.php --wiki=testwiki --current --report=1000 --output=bzip2:/mnt/data/xmldatadumps/public/testwiki/20160601/testwiki-20160601-flow.xml.bz2 (21805) started...
[effb0c25c66162949d5acb27] [no req]   Flow\Exception\InvalidDataException from line 366 of /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Model/AbstractRevision.php: Failed to load the content
Backtrace:
#0 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(395): Flow\Mode\\AbstractRevision->getContent(string)
#1 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(360): Flow\Dump\Exporter->formatRevision(Flow\Model\PostRevision)
#2 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(255): Flow\Dump\Exporter->formatRevisions(Flow\Model\PostRevision)
#3 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(213): Flow\Dump\Exporter->formatPost(Flow\Model\PostRevision)
#4 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(196): Flow\Dump\Exporter->formatTopic(Flow\Model\PostRevision)
#5 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/includes/Dump/Exporter.php(174): Flow\Dump\Exporter->formatWorkflow(Flow\Model\Workflow, Flow\Search\Iterators\HeaderIterator, Flow\Search\Iterators\TopicIterator)
#6 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/maintenance/dumpBackup.php(85): Flow\Dump\Exporter->dump(BatchRowIterator)
#7 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/maintenance/dumpBackup.php(58): FlowDumpBackup->dump(integer)
#8 /srv/mediawiki/php-1.28.0-wmf.7/maintenance/doMaintenance.php(103): FlowDumpBackup->execute()
#9 /srv/mediawiki/php-1.28.0-wmf.7/extensions/Flow/maintenance/dumpBackup.php(124): require_once(string)
#10 /srv/mediawiki/multiversion/MWScript.php(97): require_once(string)
#11 {main}

The ptwikibooks flow history dump took 62 minutes. Compare this to the mediawiki flow history dump which took 2.5 days. I would guess there's a query in there someplace that needs to be optimized, probably querying against the whole revision table. Can anyone check?

Interesting, thanks for the heads up.

I've tried dumping --full on my machine (with enwiki dataset) and on tin, and consistently get around 70-80 seconds.

mlitn@tin:~$ time mwscript extensions/Flow/maintenance/dumpBackup.php --wiki=enwiki --full --output=bzip2:/tmp/full.xml.bz2

real	1m14.007s
user	0m53.187s
sys	0m3.157s

What server are you running the dump on, and with which arguments? Is there any way we can profile it there?

Enwiki isn't a problem, mediawikiwiki is. Can you try that one?

Oh god, I misread. Yes, I'll try that one!

This is now waiting on the completion of this month's run, or at least some small wikis to run the flow history dumps, at which point I will consider it done. The mediawiki issues can get moved to a new ticket at that time.

Another thing for the Flow extension folks to look at is the failures on testwiki. I get this exception:

This is probably because of T95580: Flow data missing on Wikimedia production wikis. I've filed a follow up task (T139791: Flow: Handle missing content when dumping).

May I close this task then as done, and the followup work can go on the two tasks you linked above?