Page MenuHomePhabricator

incomplete conversion of flow revisions after disabling flow, breaks stubs dumps
Open, HighPublic0 Story Points

Description

Seen on urwikibooks and dewikiversity

dumpsgen@snapshot1008:/srv/deployment/dumps/dumps/xmldumps-backup$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=urwikibooks --full --stub --report=1 --output=file:/mnt/dumpsdata/temp/dumpsgen/stubs-history.xml.urwikibooks  --start=2673 --end=2674
MWUnknownContentModelException from line 267 of /srv/mediawiki/php-1.34.0-wmf.15/includes/content/ContentHandler.php: The content model 'flow-board' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php(446): ContentHandler::getForModelID('flow-board')
#1 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php(380): XmlDumpWriter->writeSlot(Object(MediaWiki\Revision\SlotRecord), 1)
#2 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(531): XmlDumpWriter->writeRevision(Object(stdClass), Array)
#3 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(474): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\ResultWrapper), NULL)
#4 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(288): WikiExporter->dumpPages('page_id >= 2673...', false)
#5 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(173): WikiExporter->dumpFrom('page_id >= 2673...', false)
#6 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/includes/BackupDumper.php(289): WikiExporter->pagesByRange(2673, 2674, false)
#7 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/dumpBackup.php(82): BackupDumper->dump(1, 1)
#8 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/doMaintenance.php(99): DumpBackup->execute()
#9 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/dumpBackup.php(144): require_once('/srv/mediawiki/...')
#10 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#11 {main}

and

dumpsgen@snapshot1008:/srv/deployment/dumps/dumps/xmldumps-backup$ /usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=dewikiversity --full --stub --report=1 --output=file:/mnt/dumpsdata/temp/dumpsgen/stubs-history.xml.dewikiversity  --start=47278 --end=47280
2019-07-24 18:34:19: dewikiversity (ID 132611) 0 pages (0.0|0.0/sec all|curr), 1 revs (19.3|19.3/sec all|curr), ETA 2019-07-25 03:07:34 [max 595591]
MWUnknownContentModelException from line 267 of /srv/mediawiki/php-1.34.0-wmf.15/includes/content/ContentHandler.php: The content model 'flow-board' is not registered on this wiki.
See https://www.mediawiki.org/wiki/Content_handlers to find out which extensions handle this content model.
#0 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php(446): ContentHandler::getForModelID('flow-board')
#1 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/XmlDumpWriter.php(380): XmlDumpWriter->writeSlot(Object(MediaWiki\Revision\SlotRecord), 1)
#2 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(531): XmlDumpWriter->writeRevision(Object(stdClass), Array)
#3 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(474): WikiExporter->outputPageStreamBatch(Object(Wikimedia\Rdbms\ResultWrapper), Object(stdClass))
#4 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(288): WikiExporter->dumpPages('page_id >= 4727...', false)
#5 /srv/mediawiki/php-1.34.0-wmf.15/includes/export/WikiExporter.php(173): WikiExporter->dumpFrom('page_id >= 4727...', false)
#6 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/includes/BackupDumper.php(289): WikiExporter->pagesByRange(47278, 47280, false)
#7 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/dumpBackup.php(82): BackupDumper->dump(1, 1)
#8 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/doMaintenance.php(99): DumpBackup->execute()
#9 /srv/mediawiki/php-1.34.0-wmf.15/maintenance/dumpBackup.php(144): require_once('/srv/mediawiki/...')
#10 /srv/mediawiki/multiversion/MWScript.php(101): require_once('/srv/mediawiki/...')
#11 {main}

Details

Related Gerrit Patches:

Event Timeline

ArielGlenn triaged this task as High priority.Jul 24 2019, 6:40 PM
ArielGlenn created this task.

It's still a side effect of https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/464768/ We could add one more exception to invokeLenient() in XmlDumpWriter.php I guess, so that the stubs for these last wikis can run (and the rest of the dump steps for those wikis too). But the underlying issue should be fixed as well. This Flow issue was reported earlier at T220793

Looking at where this occurs, I'd really prefer to have the Flow revisions set to the right content model for these two wikis, if it's something that can be done in the next 2-3 days. See T220594#5100772 for one of these revisions where the model name is wrong.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jul 24 2019, 7:36 PM
daniel added a subscriber: daniel.EditedJul 24 2019, 7:48 PM

Introducing a dummy ContentHandler as a fallback for this kind of thing would be pretty trivial (see also T205921#5234775). Maybe that should be on some roadmap somewhere...

Change 525352 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] workaround unknown content model for xml dump output

https://gerrit.wikimedia.org/r/525352

Woops, I codged the above together before seeing your comment, @daniel Feel free to reject, replace, whatever is appropriate.

Previous stubs had this entry for the example revision in dewikiversity:

<page>
  <title>Was wir hören und sehen</title>
  <ns>2600</ns>
  <id>47279</id>
  <revision>
    <id>274772</id>
    <timestamp>2011-08-07T10:46:42Z</timestamp>
    <contributor>
      <username>MartinKurz</username>
      <id>11256</id>
    </contributor>
    <comment>Automatische Zusammenfassung: Die Seite wurde neu angelegt.</comment>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text id="269201" bytes="352" />
    <sha1>o9mhxk86c2bxcvymcwsiirc2vyczgnp</sha1>
  </revision>
</page>

The code used to do

ContentHandler::getDefaultModelFor( $this->currentTitle );

so maybe that's a reasonable fallback for now.

Note that the actual content of these revisions is in fact wikitext; when Flow was disabled the content model was not altered properly. See T220594#5101026 for extracted text for one of them.

Copying here the comments from the patchset since it's getting a bit long.

Daniel Kinzler
12:02 AM

If that ContentHandler expects the content to conform to some specific syntax, it will die when trying to parse the content...

But if we don't actually try that, this hack could work :)

Is the ContentHandler actually used for anything but the format? If not, just set that to "unknown/unknown", and be done. Not ContentHandler needed.
ArielGlenn
12:20 AM

The previous behavior is to use the default handler for the title; unknown/unknown would be new in the dumps.

As to atually using the contenthandler, for stubs we don't care at all; for content it's dumpTextPass (other code); for ActiveAbstracts it's other code.

SpecialExport might be unhappy, I'd have to test.
Daniel Kinzler
12:33 AM

    The previous behavior is to use the default handler for the title; 

I'd consider that a bug. A consumer that relies on the mime type for parsing would fail, if that mime type was something like application/xml, but the content isn't.
Daniel Kinzler
12:35 AM

Conceptually, stubs shouldn't really have a format at all. The format tells the consumer the mime type of the serialized data between the <text> tags. If there is no data there, the format is meaningless.
ArielGlenn
12:46 AM

They are in the stubs because they are useful metadata for dump users/consumers, not because the stub user will use them for content conversion. Ideally the stubs should have absolutely all data the revision content dumps have, except for the content itself.
Daniel Kinzler
1:08 AM

But the serialization format is *not* meta data about anything except the serialization itself. I don't see how it could be useful for anything except decoding the content blob. Using them for anything else can really only be based on a misunderstanding. Which is also why rev_content_format is removed without  replacement - the format is not a property of the revision or the content, it's a property of a blob.

I'm wary of guessing what users might be doing with the data because it's usually something I didn't plan for or expect, even if that use might be wrong from where I sit.

It might be based on a misunderstanding but it's been that way since 2012 (https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/a827739fb5e494ad657719d58ca5158247ed632f). That patch falls back to guessing the model based on the title, and getting the default handler for that.

I really don't like the idea of just removing data without planning and announcement, so if we were e.g. to move the content format into the text tag as an attribute, as proposed elsewhere, that should be done via a schema change (and I'm keeping a list of these proposals so we can actually evaluate and do them once 11 goes out). Introducing a new value such as 'unknown/unknown', while something we could do in the future, isn't something I think we should do now in the context of fixing a regression.

For the moment, I'd rather restore the previous content format output behavior, relying on the principle of least surprise for downstream users.

I have run dewikiveristy and urwikibooks stubs manually with the patched XmlDumpWriter to unblock the current dump run.

UnknownContentHandler now exists, it can be used like this:

$wgContentHandlers['xyzzy'] = 'UnknownContentHandler';