To represent all page content in exports/dumps, support for extra slots needs to be added to BackupDumper, WikiExporter and XmlDumpWriter.
See https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps
To represent all page content in exports/dumps, support for extra slots needs to be added to BackupDumper, WikiExporter and XmlDumpWriter.
See https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | daniel | T174031 MCR: Include all slots in XML dumps | |||
Resolved | daniel | T198706 Make BackupDumper, WikiExporter and XmlDumpWriter compliant with the MCR revision retrival mechanism (main slot only) | |||
Resolved | BPirkle | T203424 Replace the WikiExporter backup dump streaming mode with batched queries | |||
Resolved | ArielGlenn | T199121 RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps | |||
Resolved | daniel | T238959 Make TextPassDumperTest work with 0.11 dump schema | |||
Resolved | CCicalese_WMF | T220525 MCR: Import all slots from XML dumps |
Clarifying question: what slot is Flow content considered to be in, a 'main', or a 'secondary'? All that content has such a different schema in the db tables.
@ArielGlenn Flow is in the main slot, but the content it has there isn't really its content. It's just a reference to where the actual content is stored, which is completely separate from the normal storage mechanisms.
Basically: MCR doesn't change the way in which Flow (ab)uses page content storage.
Do we currently have Flow content in dumps at all? I mean, the actual content, not the references?
...
Do we currently have Flow content in dumps at all? I mean, the actual content, not the references?
Yes, we certainly do. There's a maintenance script in the Flow extension for that, which generates both 'current' and 'history' Flow dumps.
Change 464768 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] WIP Add optional support for xml dump schema 0.11
Change 479711 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Validate the output of the dump scripts.
Change 479712 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Make the XML dump schema version configurable.
Change 480244 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add schema version param for API export.
Change 464768 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add support for xml dump schema 0.11
Change 479711 merged by jenkins-bot:
[mediawiki/core@master] Validate the output of the dump scripts.
Change 480244 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add schema version param for API export.
Change 479712 merged by jenkins-bot:
[mediawiki/core@master] Make the XML dump schema version configurable.
Current testing on beta for abstracts, stus or page logs dumps gives:
command /usr/bin/python3 xmllogs.py --config /etc/dumps/confs/wikidump.conf.labs.py3 --wiki wikidatawiki --outfile /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20190324/wikidatawiki-20190324-pages-logging.xml.gz.inprog (19215) started... Fatal error: Uncaught Error: Class 'WikiExporter' not found in /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php:107 Stack trace: #0 /srv/mediawiki/php-master/maintenance/dumpBackup.php(33): BackupDumper->__construct() #1 /srv/mediawiki/php-master/maintenance/doMaintenance.php(48): DumpBackup->__construct() #2 /srv/mediawiki/php-master/maintenance/dumpBackup.php(138): require_once('/srv/mediawiki/...') #3 /srv/mediawiki/multiversion/MWScript.php(100): require_once('/srv/mediawiki/...') #4 {main} thrown in /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php on line 107
Change 498716 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] make xml abstracts, stubs and page log dumps work again
Change 498716 merged by jenkins-bot:
[mediawiki/core@master] make xml abstracts, stubs and page log dumps work again
Page content dumps are broken on beta; all new page content (anything not prefetched out of the previosu dump file, but requested from the db) fails. After some digging, it turns out that this change broke it:
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/479712/8/maintenance/dumpTextPass.php around line 843. The address/id that comes back from fetchText.php has a newline at the end of it. In the case where it's already in the tt;nnnn format, it won't be converted and the comparison right below of the old and new address will fail.
Change 498941 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix up new text address handling in page content dumps
Change 498941 merged by jenkins-bot:
[mediawiki/core@master] fix up new text address handling in page content dumps
Change 464768 merged by jenkins-bot:
[mediawiki/core@master] Add support for xml dump schema 0.11
Change 480244 merged by jenkins-bot:
[mediawiki/core@master] Add schema version param for API export.
So the patches implementing support for MCR in XML dumps where merged, introducing a new version of the XML dump schema (version 0.11). This is however not yet enabled per default. Once this is live, we can test the new export dump format via the API, by specifying schema-0.11, on a Commons page that has a MediaInfo slot.
If that goes well, we can next experiment with creating dumps in the 0.11 format.
If that also goes well, we could announce the switch to the new format on wikitech-l, and enable it per default. This is technically a breaking change, though the new format is designed to be "mostly" backwards compatible.
@daniel It looks to me (without testing it) that the changeset above, https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/464768/ will handle empty comment text differently than we currently do, writing out a comment tag, maybe with empty content. Current practice is to omit the tag. Have you been able to check if this is true with your change?
An example of a revision with an empty comment field: on enwiki, page id 1791080 (Talk:Communist Party Historians Group), for revision 17438553, current and older stub files contain no comment tag at all, and if one looks at the revision row:
wikiadmin@10.64.16.101(enwiki)> select * from revision where rev_id = 17438553; +----------+----------+-------------+-------------+----------+------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+ | rev_id | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text | rev_timestamp | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1 | rev_content_model | rev_content_format | +----------+----------+-------------+-------------+----------+------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+ | 17438553 | 1791080 | 17438553 | | 12978 | Charles Matthews | 20050425215404 | 1 | 0 | 233 | 12810595 | ntmnkm276glud6v0z0fmqbib489ycud | NULL | NULL | +----------+----------+-------------+-------------+----------+------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
rev_comment is empty. If you track that down through the rev comment temp into the comment table, you see:
wikiadmin@10.64.16.101(enwiki)> select * from revision_comment_temp where revcomment_rev = 17438553; +----------------+-----------------------+ | revcomment_rev | revcomment_comment_id | +----------------+-----------------------+ | 17438553 | 10 | +----------------+-----------------------+
leading to
wikiadmin@10.64.16.101(enwiki)> select * from comment where comment_id = 10; +------------+--------------+--------------+--------------+ | comment_id | comment_hash | comment_text | comment_data | +------------+--------------+--------------+--------------+ | 10 | 0 | | NULL | +------------+--------------+--------------+--------------+
I should study all these edge cases and figure out unit tests for them (sigh, later ticket).
Verified that with the patch we now see an empty comment tag where before there was none. Ran in deployment prep on snapshot01 as dumpsgen user:
/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=enwiki --full --stub --report=1 --output=file:///tmp/some-stubsxml.txt --start=30 --end=35
An inspection of the file shows
<comment />
where previous runs had no comment tag for this revision. So that needs to be fixed (did hey cut the branch today yet??)
Does it? Why is this a problem? Seems more consistent, we could even consider to make it non-optional in 0.11.
Just like the uid issue, there will be clients that rely on this behavior. Let's not break it unintentionally.
@ArielGlenn Hey Ariel! Can you make a patch? Then I'll merge it. It's a one line fix, line 339 in XmlDumpWriter. If I make the patch, it may take a while to find someone to merge it.
I don't have strong feelings about this case in particular, and we should go with your preference - at least in my mind, you are the de-facto product owner of the dumps. But I'd like to share a some thoughts about compatibility in general:
I think we should generally feel free to break consumers that make unwarranted assumptions, at least in cases where a clear specification exists that explicitly defines the guarantees. A contract that gives guarantees to callers/consumers also defines the freedoms we have in changing behavior and output. Making use of that freedom preserves it. Working around brittle consumers causes complexity and inflexibility on our side.
It also seems to me that making output elements optional unnecessarily makes life harder for everyone. Every option is another code path to implement, another default to get right, for the consumers and the producer. We should consider making fewer things optional in new versions of the dump format.
Change 520385 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] Restore previous export behavior with respect to empty comment text
Ha, I made the patch without seeing your comment (but it's broken yet, give me a few minutes :-P)
We can tighten things up in a future schema, and I think we should. But people have been using who knows what sort of clients with who knows what sort of behavior for years, and we need to give them a bit of warning before doing so.
Change 520385 merged by jenkins-bot:
[mediawiki/core@master] Restore previous export behavior with respect to empty comment text
This is done. Here is a link to the API:Sandbox for exporting a file description page with a mediainfo slot:
Change 552565 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[operations/mediawiki-config@master] Ping XML dump schema version at 0.10 for now.
Change 552565 merged by jenkins-bot:
[operations/mediawiki-config@master] Pin XML dump schema version at 0.10 for now.
Mentioned in SAL (#wikimedia-operations) [2019-12-03T22:03:07Z] <jforrester@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Set wgXmlDumpSchemaVersion to 0.1.0 everywhere T238921 T174031 (duration: 01m 03s)
@daniel: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via Add Action... → Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!