Page MenuHomePhabricator

MCR: Include all slots in XML dumps
Closed, ResolvedPublic

Description

To represent all page content in exports/dumps, support for extra slots needs to be added to BackupDumper, WikiExporter and XmlDumpWriter.

See https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel moved this task from Inbox to Epic on the Multi-Content-Revisions board.Nov 20 2017, 7:40 PM
thiemowmde triaged this task as Normal priority.Dec 5 2017, 6:45 PM
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:12 PM
daniel renamed this task from Support MCR in XML dump format to Include all slots in XML dumps.Jul 9 2018, 3:55 PM
daniel updated the task description. (Show Details)
daniel renamed this task from Include all slots in XML dumps to MCR: Include all slots in XML dumps.Jul 9 2018, 3:57 PM

Not part of the MVP for SDC.

daniel moved this task from Inbox to Next on the Multi-Content-Revisions board.Sep 17 2018, 1:13 PM

Clarifying question: what slot is Flow content considered to be in, a 'main', or a 'secondary'? All that content has such a different schema in the db tables.

daniel added a comment.EditedOct 4 2018, 11:05 AM

@ArielGlenn Flow is in the main slot, but the content it has there isn't really its content. It's just a reference to where the actual content is stored, which is completely separate from the normal storage mechanisms.

Basically: MCR doesn't change the way in which Flow (ab)uses page content storage.

Do we currently have Flow content in dumps at all? I mean, the actual content, not the references?

...

Do we currently have Flow content in dumps at all? I mean, the actual content, not the references?

Yes, we certainly do. There's a maintenance script in the Flow extension for that, which generates both 'current' and 'history' Flow dumps.

Yes, we certainly do. There's a maintenance script in the Flow extension for that, which generates both 'current' and 'history' Flow dumps.

Ah, ok, that's a completely separate mechanism. Makes sense.

Change 464768 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] WIP Add optional support for xml dump schema 0.11

https://gerrit.wikimedia.org/r/464768

Change 479711 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Validate the output of the dump scripts.

https://gerrit.wikimedia.org/r/479711

Change 479712 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Make the XML dump schema version configurable.

https://gerrit.wikimedia.org/r/479712

Change 480244 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add schema version param for API export.

https://gerrit.wikimedia.org/r/480244

greg added a project: Multimedia.Mar 7 2019, 10:59 PM
Restricted Application added a project: Multimedia. · View Herald TranscriptMar 15 2019, 5:53 PM

Change 464768 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add support for xml dump schema 0.11

https://gerrit.wikimedia.org/r/464768

Change 479711 merged by jenkins-bot:
[mediawiki/core@master] Validate the output of the dump scripts.

https://gerrit.wikimedia.org/r/479711

Change 480244 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Add schema version param for API export.

https://gerrit.wikimedia.org/r/480244

Change 479712 merged by jenkins-bot:
[mediawiki/core@master] Make the XML dump schema version configurable.

https://gerrit.wikimedia.org/r/479712

Current testing on beta for abstracts, stus or page logs dumps gives:

command /usr/bin/python3 xmllogs.py --config /etc/dumps/confs/wikidump.conf.labs.py3 --wiki wikidatawiki --outfile /mnt/dumpsdata/xmldatadumps/public/wikidatawiki/20190324/wikidatawiki-20190324-pages-logging.xml.gz.inprog (19215) started... 
Fatal error: Uncaught Error: Class 'WikiExporter' not found in /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php:107
Stack trace:
#0 /srv/mediawiki/php-master/maintenance/dumpBackup.php(33): BackupDumper->__construct()
#1 /srv/mediawiki/php-master/maintenance/doMaintenance.php(48): DumpBackup->__construct()
#2 /srv/mediawiki/php-master/maintenance/dumpBackup.php(138): require_once('/srv/mediawiki/...')
#3 /srv/mediawiki/multiversion/MWScript.php(100): require_once('/srv/mediawiki/...')
#4 {main}
  thrown in /srv/mediawiki/php-master/maintenance/includes/BackupDumper.php on line 107

Change 498716 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] make xml abstracts, stubs and page log dumps work again

https://gerrit.wikimedia.org/r/498716

Change 498716 merged by jenkins-bot:
[mediawiki/core@master] make xml abstracts, stubs and page log dumps work again

https://gerrit.wikimedia.org/r/498716

Page content dumps are broken on beta; all new page content (anything not prefetched out of the previosu dump file, but requested from the db) fails. After some digging, it turns out that this change broke it:

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/479712/8/maintenance/dumpTextPass.php around line 843. The address/id that comes back from fetchText.php has a newline at the end of it. In the case where it's already in the tt;nnnn format, it won't be converted and the comparison right below of the old and new address will fail.

Change 498941 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] fix up new text address handling in page content dumps

https://gerrit.wikimedia.org/r/498941

Change 498941 merged by jenkins-bot:
[mediawiki/core@master] fix up new text address handling in page content dumps

https://gerrit.wikimedia.org/r/498941

Change 464768 merged by jenkins-bot:
[mediawiki/core@master] Add support for xml dump schema 0.11

https://gerrit.wikimedia.org/r/464768

Change 480244 merged by jenkins-bot:
[mediawiki/core@master] Add schema version param for API export.

https://gerrit.wikimedia.org/r/480244

So the patches implementing support for MCR in XML dumps where merged, introducing a new version of the XML dump schema (version 0.11). This is however not yet enabled per default. Once this is live, we can test the new export dump format via the API, by specifying schema-0.11, on a Commons page that has a MediaInfo slot.

If that goes well, we can next experiment with creating dumps in the 0.11 format.

If that also goes well, we could announce the switch to the new format on wikitech-l, and enable it per default. This is technically a breaking change, though the new format is designed to be "mostly" backwards compatible.

@daniel It looks to me (without testing it) that the changeset above, https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/464768/ will handle empty comment text differently than we currently do, writing out a comment tag, maybe with empty content. Current practice is to omit the tag. Have you been able to check if this is true with your change?

An example of a revision with an empty comment field: on enwiki, page id 1791080 (Talk:Communist Party Historians Group), for revision 17438553, current and older stub files contain no comment tag at all, and if one looks at the revision row:

wikiadmin@10.64.16.101(enwiki)> select * from revision where rev_id = 17438553;
+----------+----------+-------------+-------------+----------+------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| rev_id   | rev_page | rev_text_id | rev_comment | rev_user | rev_user_text    | rev_timestamp  | rev_minor_edit | rev_deleted | rev_len | rev_parent_id | rev_sha1                        | rev_content_model | rev_content_format |
+----------+----------+-------------+-------------+----------+------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+
| 17438553 |  1791080 |    17438553 |             |    12978 | Charles Matthews | 20050425215404 |              1 |           0 |     233 |      12810595 | ntmnkm276glud6v0z0fmqbib489ycud | NULL              | NULL               |
+----------+----------+-------------+-------------+----------+------------------+----------------+----------------+-------------+---------+---------------+---------------------------------+-------------------+--------------------+

rev_comment is empty. If you track that down through the rev comment temp into the comment table, you see:

wikiadmin@10.64.16.101(enwiki)> select * from revision_comment_temp where revcomment_rev = 17438553;
+----------------+-----------------------+
| revcomment_rev | revcomment_comment_id |
+----------------+-----------------------+
|       17438553 |                    10 |
+----------------+-----------------------+

leading to

wikiadmin@10.64.16.101(enwiki)> select * from comment where comment_id = 10;
+------------+--------------+--------------+--------------+
| comment_id | comment_hash | comment_text | comment_data |
+------------+--------------+--------------+--------------+
|         10 |            0 |              | NULL         |
+------------+--------------+--------------+--------------+

I should study all these edge cases and figure out unit tests for them (sigh, later ticket).

Verified that with the patch we now see an empty comment tag where before there was none. Ran in deployment prep on snapshot01 as dumpsgen user:

/usr/bin/php7.2 /srv/mediawiki/multiversion/MWScript.php dumpBackup.php --wiki=enwiki --full  --stub --report=1 --output=file:///tmp/some-stubsxml.txt --start=30 --end=35

An inspection of the file shows

<comment />

where previous runs had no comment tag for this revision. So that needs to be fixed (did hey cut the branch today yet??)

daniel added a comment.Jul 2 2019, 1:43 PM

where previous runs had no comment tag for this revision. So that needs to be fixed (did hey cut the branch today yet??)

Does it? Why is this a problem? Seems more consistent, we could even consider to make it non-optional in 0.11.

Just like the uid issue, there will be clients that rely on this behavior. Let's not break it unintentionally.

BTW I apologize for getting to this late, other testing and reviews got in the way.

daniel added a comment.Jul 3 2019, 8:34 AM

@ArielGlenn Hey Ariel! Can you make a patch? Then I'll merge it. It's a one line fix, line 339 in XmlDumpWriter. If I make the patch, it may take a while to find someone to merge it.

I don't have strong feelings about this case in particular, and we should go with your preference - at least in my mind, you are the de-facto product owner of the dumps. But I'd like to share a some thoughts about compatibility in general:

I think we should generally feel free to break consumers that make unwarranted assumptions, at least in cases where a clear specification exists that explicitly defines the guarantees. A contract that gives guarantees to callers/consumers also defines the freedoms we have in changing behavior and output. Making use of that freedom preserves it. Working around brittle consumers causes complexity and inflexibility on our side.

It also seems to me that making output elements optional unnecessarily makes life harder for everyone. Every option is another code path to implement, another default to get right, for the consumers and the producer. We should consider making fewer things optional in new versions of the dump format.

Change 520385 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] Restore previous export behavior with respect to empty comment text

https://gerrit.wikimedia.org/r/520385

Ha, I made the patch without seeing your comment (but it's broken yet, give me a few minutes :-P)

We can tighten things up in a future schema, and I think we should. But people have been using who knows what sort of clients with who knows what sort of behavior for years, and we need to give them a bit of warning before doing so.

Change 520385 merged by jenkins-bot:
[mediawiki/core@master] Restore previous export behavior with respect to empty comment text

https://gerrit.wikimedia.org/r/520385

Thanks for the quick merge! Output looks good.