Page MenuHomePhabricator

RFC: Spec for representing multiple content objects per revision (MCR) in XML dumps
Closed, ResolvedPublic

Description

This is a placeholder ticket for an RFC about how the content of multiple slots should be represented in XML dumps.

Draft in progress (proposal section still empty) at: https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Introducing a <content> tag that wraps a <text> tag and additional tags for meta-data looks nice and clean, but it's also completely incompatible with existing consumers.

Just generating multiple <text> tags, and addition any new info as attributes on them, would allow existing consumer code that follows the "grab the content of the (first) text tag" to continue functioning.

Tgr added a comment.Aug 10 2018, 9:56 AM

Or just drop the idea of a transitional format and keep the main slot one level higher forever (or at least until the next change that really needs to be a BC break). Old clients will not break, MCR-aware clients will maybe need very slightly less elegant parsing code, which does not seem like a bad trade-off.

CR-aware clients will maybe need very slightly less elegant parsing code, which does not seem like a bad trade-off.

It's an option, but it feels like we'd be shifting the paint to consumers, who have to maintain that duplicate logic forever. And I don't see how it's better than the multiple-text-tags-with-attributes approach.

Tgr added a comment.Aug 10 2018, 11:35 AM

And I don't see how it's better than the multiple-text-tags-with-attributes approach.

How safe is the assumption that an XML parsing library that expects a tag to be unique will just ignore extra occurrences and return the first tag (as opposed to, say, returning the last one or raising an error)? I imagine DOM parsers would typically take the first and SAX parsers would typically take the last and get confused, and multi-gigabyte documents are more likely to be parsed in a SAX fashion.

SAX parsers would typically take the last and get confused

True, that's a risk - they would then be using the wrong slot.

brion added a comment.Aug 13 2018, 8:46 AM

My concern with the two-step transition idea is that some consumers may not update on a reliable schedule, or may not be able to do so easily. For instance, if people are using Special:Export on one wiki and Special:Import'ing those pages on another that's *not* a Wikimedia-hosted site, it's more likely to be an older version of MediaWiki.

Should test Special:Import on the proposed transitional schema to make sure what happens. :)

brion added a comment.Aug 13 2018, 9:00 AM

Ok, proposed transitional schema looks like it imports cleanly via importDump (which uses same code path as Special:Import). The proposed final schema, however, imports a revision with empty text (and throws a notice on Undefined index: text in /vagrant/mediawiki/includes/import/WikiImporter.php on line 886).

I know it feels less elegant, but I would recommend sticking with the back-compatible 'transitional' schema and not doing a second transition to minimize impact on interop issues.

brion added a comment.Aug 13 2018, 9:05 AM

As for role ids -- perhaps we should primarily use the names, not the numbers, in the <role> bit. It's analogous to a page's <title> reference (a primary identifier) not to its <ns> or <id> (which are provided informatively if you want to repro the database exactly, but can be freely discarded when doing imports and such).

Thanks for the input so far!

I'm just as happy not to expose content model ids. Relying on slot role names exclusively while hiding the ids makes me a little nervous, only because I fear those names may be changed in the future from time to time. Dumps processors would see such a change without realizing that the slot id is in fact the same (and so the change is only cosmetic). Cn someone say a little bit more about these names?

My main reason for the two stage format was that I hate the mixed format approach (main text is formatted this way but the rest is formatted differently). If folks think avoiding this not worth the headache of having two format changes, I can drop the final format.

I'm still thinking about whether multiple text tags for the remaining slots, as per @daniel, is desirable or not.

I've updated the draft RFC to remove the 'final' schema, leaving the 'transitional' schema as the new schema proposal; I've munged the 'header changes' section leaving my question about possible changes to slot role names in there for comment. Still thinking about @daniel's proposal (a series of text tags with attributes).

question about possible changes to slot role names

Slot role names are public identifiers, they cannot change, ever. Same for model names.

I've updated the draft RFC to include solt role name instead of slot role id. I'm still thinking about the multiple text element with attributes, though leaning against it.

Question: what is address="tt:12345" in the alternative proposal here https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps ?

Question: what is address="tt:12345" in the alternative proposal here https://www.mediawiki.org/wiki/Multi-Content_Revisions/Dumps ?

That's the replacement for the text id, for use with stub dumps. We now have a blob store that resolves url-style addresses. The idea is the same as for text id, and the "tt" prefix means "text table", so "tt:1234" is the same as text id 1234. We may in the future however have different storage mechanism. The most obvious would be to link to externalstore directly and bypass the text table.

That was very helpful, thanks.

Okay, here's my take. At some point in the future (unknown when), we might lose the text table; we'd have to have someplace for third-party installations to store their revision text, and whatever that mechanism is (not necessarily an external store), would be supported along with the external store. We also need to support installations that do not enable MCR. What that all looks like should be dealt with then. Until then, it's ok to continue to use the text ids as is.

Because there's some doubt about which element a given parser may grab if it expects the element to be unique (see T199121#4494050), I'd rather avoid having multiple text elements and use the safer, if less satisfying, mixed format for now, of an unwrapped text element tag formatted the same old way, and new content tags which will be unlooked for and hopefully ignored by existing tools.This also avoids duplicating revision information into the text attributes for the main slot for the sake of backwards compatibility. I agree that when we come up to another breaking change in the future, we should revisit the mixed format.

It's true that most information we provide about the content element really will be attributes of the content in a given slot and not pieces of data somehow bundled up as part of the content, so it would more properly be represented as attributes rather an as revision child elements. We could move the attributes into the content tag thus:

<content origin="308722098" role="wd_entity" model="metadata" format="text/json" text_id="305112983" sha1="..." bytes="xxx" />

for first-pass ('stub') dumps, or like this:

<content xml:space="preserve" origin="308722098" role="wd_entity" model="metadata" format="text/json" sha1="...">stuff goes here...
... 
</content>

for second-pass (revision content) dumps.
In the case that the revision has been suppressed/deleted, we could produce the following:

<content role="wd_entity" model="metadata" format="text/json" deleted="deleted" />

for either pass of the dumps. We would do this for each slot (except 'main', which would be formatted as a plain old boring text element in the way currently done), since suppression/deletion is at the revision level only, not per slot.

What do folks think about the above?

daniel added a comment.EditedAug 27 2018, 4:41 PM

we'd have to have someplace for third-party installations to store their revision text, and whatever that mechanism is (not necessarily an external store), would be supported along with the external store. We also need to support installations that do not enable MCR. What that all looks like should be dealt with then. Until then, it's ok to continue to use the text ids as is.

Not enabling MCR is not a thing - it's not necessarily use MCR, but 1.32 will use the MCR storage schema, with content addresses instead of text ID, for all storage. The old schema will be supported for B/C, but I'd want it to be dropped in 1.33. rev_text_id will then no longer exist (and if it exists, it will be 0). The content address is an integral part of the new storage schema as deployed, the live sites will start relying on it soon, though rev_text_id will still be written for B/C until nothing uses it anymore.

So, for the new schema, I'd highly recommend to remove any mention of a numeric text ID, in favor of an opaque content address. See https://gerrit.wikimedia.org/r/c/mediawiki/core/+/443608/14 for the patch that makes the dump infrastructure use addresses instead of text IDs. Review would be appreciated, since that patch blocks the SDC project. (the patch currently fails CI, but that failure is inherited from the parent patch - working on it).

I'm down with the rest of your proposal: keeping <text> and adding <content> isn't pretty, but the most pragmatic solution.

Oh, one more comment:

We would do this for each slot (except 'main', which would be formatted as a plain old boring text element in the way currently done), since suppression/deletion is at the revision level only, not per slot.

Per-slot suppression isn't possible at the moment, but it's one of the things that I can very well imagine people will want soon. So we should at least make it easy to add that to the schema without too much trouble.

Aklapper changed the edit policy from "Custom Policy" to "All Users".Sep 17 2018, 5:54 PM
Aklapper changed Risk Rating from N/A to default.
Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 19 2018, 7:09 AM

I'd really like to move this forward. Ideally, we'd get the new dump format into the 1.32 release. @ArielGlenn is there anything holding this back? Can we have an IRC discussion on this soon?

daniel moved this task from Backlog to Inbox on the TechCom-RFC board.Sep 19 2018, 5:33 PM

Sorry, I've been trying to get a couple other things off my plate. Will be coming back to this shortly.

I just realized that the proposed dump format is still using numeric text IDs. That cannot be guaranteed to work, text blobs are now identified by URL-like blob addresses: "tt:12345" is the address of text row 12345, and we may start using "ext:DB:..." for ExternalStore soon.

So, instead of <text id="305112983" bytes="143" /> we need to use <text id="tt:305112983" bytes="143" />. The numeric form could still be supported for backwards compatibility, with the prefix "tt:" being assumed if none is given.

I wish I'd had a pointer to the changes in Storage/Blobstore.php and SqlBlobStore.php earlier, didn't realize this new form of text addressing was baked in already. Anyways, I've revised the draft accordingly, sorry for the long delay. Main slot still only gets an id number for text id (this is required for backwards compat); text id in content element gets schema:location format. Is that pattern regex in the schema ok?

Per slot suppression should still be ok even for 'main'; if the main slot is suppressed then we want to output a text element <text deleted="deleted" /> just as we do now if the entire revision is suppressed.

Hopefully that addresses the last of the concerns. Please let me know if you see anything amiss.

daniel added a comment.EditedSep 26 2018, 12:52 PM

Hi Ariel! Thanks for updating. Sorry for the confusion about the address format, I thought you knew this was already in. This is what the content_address field contains. I tried to explain the idea in August, see T199121#4529353.

Main slot still only gets an id number for text id (this is required for backwards compat)

This would break as soon as we start addressing external store entries directly - at that point, there will be no corresponding rows in the text table any more.

Perhaps this can be solved by using a new "address" attribute for the full address, and provide the old id attribute on the main slot's text tag (or even all text tags) if possible (that is, if the address is a "tt:" address). The id should then be marked deprecated and optional (or should become optional in the next version of the spec - but then we'd have to fail the export when encountering a non-tt address).

Leaving the id attribute as int, and introducing a new address attribute for the full address seems cleaner in any case. Using different formats for the id attribute of the text tag depending on where the text tag is used seems very confusing.

Is that pattern regex in the schema ok?

It's a bit too restrictive. One thing that should definitely work there are addresses pointing to external store, which requires flags to be embedded in the address, resulting in something like this: ex:DB://cluster5/7633454/89?utf8&zlib. To provide maximum flexibility, I'd just go for xsd:anyURI.

I wasn't happy with the different formats for the same attribute name either, using a different name for the content text id is grand! Using anyURI makes for a simpler schema too.

I've added the location attr as optional for the main slot, with the intent that it gets used when we have some other schema than 'tt' in play. The next version would have the id attr as optional + deprecated for the main slot, only permitted to be written for blobs with the 'tt' schema, and the location attr as mandatory. How does that sound?

daniel added a comment.EditedSep 26 2018, 3:48 PM

I've added the location attr as optional for the main slot, with the intent that it gets used when we have some other schema than 'tt' in play. The next version would have the id attr as optional + deprecated for the main slot, only permitted to be written for blobs with the 'tt' schema, and the location attr as mandatory. How does that sound?

Sounds mostly good, except that we have to bump the XML schema when we want to do T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production). If we made id optional & deprecated right away, we wouldn't have to do that. Perhaps this is one of the questions to answer during the RFC discussion.

Also, I don't see why location should be optional. Why not just always output it? Afterall, it's what we want people to use in the future, why not make it easy for them to use it now? Though I'm not sure who actually consumes these attributes, and for what purpose (other than generating a full dump from a stub dump). The values are purely internal.

...

Sounds mostly good, except that we have to bump the XML schema when we want to do T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production). If we made id optional & deprecated right away, we wouldn't have to do that. Perhaps this is one of the questions to answer during the RFC discussion.

Let's talk about it there, indeed. I'd prefer to be explicit about when the 'id' attribute is going away, rather than marking it optional indefinitely.

Also, I don't see why location should be optional. Why not just always output it? Afterall, it's what we want people to use in the future, why not make it easy for them to use it now? Though I'm not sure who actually consumes these attributes, and for what purpose (other than generating a full dump from a stub dump). The values are purely internal.

I'm reluctant to change anything about the formatting of the main slot this round. There are tools that convert xml dumps to sql suitable for import, and the text id may be used by some of these as a convenience for constructing the text table entry.

I'm reluctant to change anything about the formatting of the main slot this round. There are tools that convert xml dumps to sql suitable for import, and the text id may be used by some of these as a convenience for constructing the text table entry.

Such tools will have to be rewritten to support the new MCR schema anyway...

This will be discussed at the TechCom meeting Wednesday, October 3rd at 2pm PST(21:00 UTC, 23:00 CET). The announcement was sent to Wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2018-September/090881.html

I have also sent email about it to the xmldatadumps-l mailing list.

daniel moved this task from Inbox to Request IRC meeting on the TechCom-RFC board.Oct 3 2018, 7:55 PM
Tgr added a comment.Oct 4 2018, 7:59 AM

Some issues that we did not have time to fully discuss during the meeting:

  • sha1 B/C. There are two candidates for the old sha1 field: the sha1 of the main slot and the sha1 of the full revision (which is computed as taking the base36 sha1 of slot 1, concatenating the raw value of slot 2, taking the sha1 of that, concatenating slot 3, taking the sha1 of that etc). When there is just the main slot, the two coincide, so no B/C issue with that. When there are multiple slots, which is the more B/C friendly option? sha1 is mainly used for revert detection, but how would an application that is only aware of the main slot define reverts?
  • Backfilling the id. Legacy clients expect a numerical text id for the main slot; when we switch to external storage, we'll have an URL instead (which will be exposed as a different attribute). Can we provide some id still so old clients don't break? The actual value of the id is internal detail and not relevant to clients; what's relevant is that it should either be unique or (preferably) hash-like (ie. only main slots with the same text have the same id). In theory we could use some kind of content hash (first N digits of the base10 sha1, for example) but it will not fit into a 32 bit integer (current max text ID is about 900 million; fake IDs should avoid any range that can be, or could be within reasonable time, a text ID; max value for a signed int32 is around 2 billion) - is that likely enough to break clients to make the whole idea moot?
  • in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts) - not so much in the actual dump files, as the compression there takes care of duplicate content anyway, but it would mean less text to process, both in the dump infrastructure and for reusers. Text IDs / blob URLs can be assumed to be deduplicated, but they also cannot be published without checking that they correspond to a (visible) revision. That seems like a hard problem; can we do something about it?
daniel added a comment.EditedOct 4 2018, 8:16 AM

Quick addendum to @Tgr's last point:

in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts)

Reverts are not a new problem, and not the largest problem. Inherited slots are: If a page has three slots, but only one of the is frequently edited, the content of the other two will be copied into the output over and over, for every revision. In the database, this is deduplicated using the "inheritance" mechanism, but in the dump, no such mechanism exists yet.

not so much in the actual dump files, as the compression there takes care of duplicate content anyway, but it would mean less text to process, both in the dump infrastructure and for reusers. Text IDs / blob URLs can be assumed to be deduplicated, but they also cannot be published without checking that they correspond to a (visible) revision. That seems like a hard problem; can we do something about it?

One approach I could image ins:

  • always output the location attribute (instead of only using it in stub mode and omitting it in full mode)
  • when generating output for a given page, keep a ledger of the content location already emitted for that page.
  • when encountering a content location that was already emitted, only emit a sub version of the <text> tag, instead of the full content.

For a single page, the total number of different locations should be small enough to not cause a problem with the size of the ledger. But if we want to make sure, the ledger can be made "leaky", allowing for false negatives. This way, deduplication may not be perfect, but duplicate output would still be rare, while enforcing a cap on the ledger size. This approach is used by Wikibase when emitting RDF dumps, it has an implementation of such a "leaky ledger" in HashDedupeBag, using truncated hashes as keys.

Deduplication however is a major breaking change, and should probably not be in the first XML schema that supports MCR. I'd tag that for version 12.

Following up on the deduplication issue raised above:

The main concern about bloat with the new schema, as I understand it, is that it may be common for only one slot's content to change, in which case we don't want to write out the content for the other slots. This is easy enough to deal with, by just keeping the information for the last revision's slots and texts around when writing out the metadata (stubs) or the full dump in the case of Special:Export; if the slot origin of the new content is the slot revision id we read/request and write the content, otherwise we know it's a duplicate and we write an empty text tag which contains, say, <text duplicate="duplicate" />, akin to the tag we generate in stubs for texts which have deleted content, i.e. <text deleted="deleted" />. There's no need to do full deduplication.

This does mean we have entries for content, origin (required so that the dumps consumer knows which revision contains the full copy of the content), role, model, format, sha1 for every slot's content regardless, or about 250 extra bytes per entry. For 52 million pages, as we currently have on wikidatawiki, that adds up to 13G or so uncompressed, which is pretty reasonable.

I also agree that this should be version 12, but with an eye on growth in size (and slowness) of dumps production; if we have a bunch of bots inserting structured data on Commons and there's a lot of bloat all of a sudden, version 12 might need to be moved up earlier than we plan.

I'll make the changes agreed upon in last night's meeting to the RFC a bit later today and will note here when they are done.

daniel added a comment.Oct 4 2018, 9:59 AM

Side note re the id attribute becoming optional: hasn't it always be (formally) optional, because it was only emitted for stubs?

ArielGlenn added a comment.EditedOct 4 2018, 12:50 PM

https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps#Schema This has now been updated.

Side note re the id attribute becoming optional: hasn't it always be (formally) optional, because it was only emitted for stubs?

Note that text location and id attrs might be omitted in the case where the text is deleted (in practice, this is what the code does), and yet id has never been marked as optional, so I am not marking location as optional either. It might not pay to look too closely at the innards of these things, rather like sausages...

Next up is dealing with the revision vs main slot sha1. How is the revision sha1 computed in the case where there is content in multiple slots, just for my edification?
Answering my own question, it seems, looking at Storage::RevisionStore::computeSha1 that we do sha1(concat(sha1(concat(sha1(slot1 contents), slot2 contents), slot3 contents)... Next question, is this stored/updated in the revision table after any edit?

daniel added a comment.Oct 4 2018, 2:02 PM

Note that text location and id attrs might be omitted in the case where the text is deleted (in practice, this is what the code does), and yet id has never been marked as optional, so I am not marking location as optional either.

I'd rather argue that the spec should be fix to reflect reality. I see no reason to continue spreading a lie ;)

(Sorry for the near-stream-of-consciousness updates here, just trying to Get **It Done.) To get the sha1 discussion started:

I can't imagine any script is going to care about the revision sha1 as separate from the sha1 of the content of each slot separately; if you want to know if an edit was reverted, you can look at the sha1 of the individual slots, and I imagine that much analysis will focus on slot-level activity rather than the set of slots for a given revision. Until we have content in multiple slots, the value will be the same, so it, like everything else, will become an issue next month when Structured Data on Commons goes live. For that reason I'd put the main slot content sha1 in the revision sha1 tag and not even bother with the computed revision sha1 described above.

Please poke holes in this so that we can reach agreement as soon as possible :-)

daniel added a comment.EditedOct 4 2018, 5:23 PM

I can't imagine any script is going to care about the revision sha1 as separate from the sha1 of the content of each slot separately; if you want to know if an edit was reverted, you can look at the sha1 of the individual slots, and I imagine that much analysis will focus on slot-level activity rather than the set of slots for a given revision.

Anything using the existing revision level sha1 for revert detection will miss-detect a revert (or a null-edit) for *all* revisions that did not affect the main slot. While analysis on the slot level may be useful, existing analysis is on the revision level (by definition - slots are new). So it seems reasonable to keep revision-level semantics intact.

Whatever we do, we should definitely include both hashes (main slot and revision), to make the distinction obvious, and the path forward clear, and give consumers the option to change their code to consistently do the thing they want. Only if both hashes are available can we support both kinds of consumers - those that focus on the revision as a whole, and those that focus on individual slots.

A more philosophical reason to maintain the revision level logic: the "content of the revision" is the combined content of all slots. The intended semantics is not "slots are optional miscellany bits and pieces". The intended semantics is "the revision's content consists of multiple slots".

So, in essense: we would break the assumption that two revisions that have the same hash have the same content. We'd also break consistency with revision hashes reported the API.

daniel added a comment.Oct 4 2018, 8:50 PM

Re the ìd`attribute being optional or not: turns out, it's optional already: The "use" attribute of the <attribute> element in an xml schema is indeed optional, and its default value is "optional" :) The fact that this is declared for all other attributes but omitted for id is confusing, though, and should be fixed.

...

Anything using the existing revision level sha1 for revert detection will miss-detect a revert (or a null-edit) for *all* revisions that did not affect the main slot. While analysis on the slot level may be useful, existing analysis is on the revision level (by definition - slots are new). So it seems reasonable to keep revision-level semantics intact.
Whatever we do, we should definitely include both hashes (main slot and revision), to make the distinction obvious, and the path forward clear, and give consumers the option to change their code to consistently do the thing they want. Only if both hashes are available can we support both kinds of consumers - those that focus on the revision as a whole, and those that focus on individual slots.

Yes, I think we do need and want both sha1 values in the output.

A more philosophical reason to maintain the revision level logic: the "content of the revision" is the combined content of all slots. The intended semantics is not "slots are optional miscellany bits and pieces". The intended semantics is "the revision's content consists of multiple slots".
So, in essense: we would break the assumption that two revisions that have the same hash have the same content. We'd also break consistency with revision hashes reported the API.

I understand and agree with the idea that a reivision consists of the content of all of its slots.
What I'm getting at is that folks have until now been studying article or other page content. Sure, there hasn't been other content available for them to examine, but I imagine that a vast majority of folks will still be interested primarily in article content and how it changes over time, as opposed to , say, considering reverts also of various structured data entries for media files. And folks looking at article reverts and expecting to just pick those up will get a bunch of extra entries if they rely on the rev sha1 once other slots have content in them. It's not unreasonable to ask folks who are interested in the content *also* of other slots to check those sha1s specifically, or to look at a new entry which contains *specifically* the rev sha1.

The problem is that there's no nested level where we can put another sha1 tag. We could introduce a new element revsha1, though I don't like it much. We could move both sha1s into attributes of their respective tags (text and revision), which I like even less.

What I'm getting at is that folks have until now been studying article or other page content. Sure, there hasn't been other content available for them to examine, but I imagine that a vast majority of folks will still be interested primarily in article content and how it changes over time, as opposed to , say, considering reverts also of various structured data entries for media files. And folks looking at article reverts and expecting to just pick those up will get a bunch of extra entries if they rely on the rev sha1 once other slots have content in them.

In my experience, dump analysis that is interested in reverts typically doesn't care about the content at all. It analyzes how often reverts happen, how they happen, who does them, how long content that is later reverted (and thus assumed to be "bad") remained visible to the public. All this is on the revision level and would break if we changed the semantics of the <sha1> tag. But we shouldn't guess how people use the hash, we should ask them... the trouble with that is: it takes time.

But a quick search on google scholar turns up a few familiar names, like Denny Vrandecic, Aaron Halfaker, Luca de Alfaro. Denny in particular seems like a good candidate to provide insights, as an author of Revisiting reverts: accurate revert detection in wikipedia.

@Denny @vrandezo @Halfak, what's your take on this? Should the <sha1> tag in dumps continue to match the main slot's content, or continue to match the revision's entire content? We will have to break one of these two assumptions...

As to where to put the hash: In my opinion, the content hash that is a sha1 of the serialized content should be an attribute, just like the byte size. It relates to the serialized blob, and should thus be attached to the <text> tag.

Halfak added a subscriber: FaFlo.Oct 9 2018, 2:34 PM

Seems to me that each slot should have it's own sha1. There's a huge amount of research of non-article content in Wikipedia. I imagine that analysts will be interested in identity reverts (the most common type that are detectable with sha1 histories) in main and any other slots. Having one sha1 represent some concatenation of all the slots isn't as useful as having a sha1 per slot. If I wanted to look for identity reverts across all slots, then I'd simply concatenate the sha1s myself during my analysis.

As for where the sha1's exist in the XML, I'm not sure I have a strong opinion there. It's hard to work out what is being proposed from this enormous Phab task. But from the wiki page, I see <sha1> tags in each <content> "slot" and that makes sense to me. I'm not very worried about having the <sha1> tag because there's already a substantial change to the content structure being proposed.

This is maybe besides the point, but I have many issues with the claims of Flock et al's Revisiting Reverts paper which seems to be driving the research practice away from the use of checksums. In my expert opinion, checksum-based revert detection will be an important measurement strategy for a long future while fine-grained content persistence approaches like those developed by myself and Flock et al, will grow in parallel and not supersede the use of sha1s. I think that @FaFlo, @Denny, and I could have a lot more words about this so I'd encourage taking that off task if y'all are interested in discussing the future of revert detection generally. For the purposes of this task, I think we can agree on sha1s having lasting value for analytics and research broadly.

daniel added a comment.EditedOct 9 2018, 3:28 PM

@Halfak having per-slot hashes is not controversial, the question is what to do with the <sha1> tag that currently exists on the revision level. If we make this the has of the main slot, we will break the assumption that two revisions that have the same hash there have the same content, and thus constitute a revert. Any consumer making that assumption would then miss-detect reverts.

To me, it seems better to use the existing <sha1> for the combined revision-level hash, and have a new hash attribute in each of the <text> tags, because this would avoid braking the "same hash, same content, must be a revert" assumption. But it would break the "this hash is a hash of the bytes the text text tag" assumption. But I don't know of any consumer that actually relies on this.

Do you agree that maintaining the "this hash can be used to decide whether one revision is a revert to another" would eb good? Or do you think it does not matter?

Halfak added a comment.Oct 9 2018, 3:30 PM

Oh. I guess my sense is to drop the existing <sha1> in favor of the <sha1>'s related to individual content slots. It doesn't make sense anymore, and we're breaking the schema anyway to add new content slots.

daniel added a comment.Oct 9 2018, 3:32 PM

Well, we are trying to keep compatibility for most clients that just ignore stuff they don't know. They would still be able to process the dumps as before. Removing the <sha1> tag would be cleaner and safer, but it would also be a hard B/C break. So, IF we keep it, should we keep it with the this-is-the-revision-hash semantics? or with the this-is-the-text-tag-hash semantics?

FaFlo added a comment.Oct 9 2018, 4:57 PM

Of course checksums make lot of of sense for countless use cases, including many in research (mentioned paper was never intended to make a sweeping point to the contrary, but yes, discussion for another time).
And I think MCR is awesome, JFTR.

Regarding the question at hand, I would feel that "this-is-the-text-tag-hash semantics" would provide more value (from my research-bubble perspective), even if it breaks convention. Then, every single part has an individual sha (just as before with the single text part, now there are just more parts). I subscribe to the we-are-breaking-it-anyway view.

daniel added a comment.Oct 9 2018, 5:46 PM

Ok, so if we use the <sha1> tag on the revision level for the main tag hash, where do we put the revision hash? Or do we just ignore it? Note that the revision has is in the database, and is exposed via the API. It's not made up for the purpose of the dumps.

Tgr added a comment.Oct 9 2018, 6:11 PM

Just add a <revision-sha1> tag?

Halfak added a comment.Oct 9 2018, 6:54 PM

where do we put the revision hash?

What is the "revision hash"?

daniel added a comment.EditedOct 9 2018, 7:46 PM

What is the "revision hash"?

A combined hash the identifies the content of the slot across all revisions. It's stored in rev_sha1 in the database, and used by stuff on labs to detect manual reverts. It's also available from the API as rvprop=sha1.

It's currently(*) defined as sha1( slot1 ° sha1( slot2 ° sha1( slot3 ° ... )))). So if there is only one slot, it's just sha1( slot1 ). ° meaning concat.

(*) currently, because this isn't ideal, see T185793.

Re the ìd`attribute being optional or not: turns out, it's optional already: The "use" attribute of the <attribute> element in an xml schema is indeed optional, and its default value is "optional" :) The fact that this is declared for all other attributes but omitted for id is confusing, though, and should be fixed.

On the topic of 'optional' vs 'required' attributes, we have 13 attributes in the 0.10 schema, 2 of which are required, 7 of which are marked as optional explicitly and 4 of which are unmarked and so optional anyways. If we are going to clean things up we should either mark them all or remove 'use="optional"' from all. The mixture we currently have is subject to misinterpretation, as you point out.

Halfak added a comment.EditedOct 10 2018, 10:20 PM

re. the "revision hash", it seems that this has already been determined so I'm not sure what other insights I might give. But FWIW, the combined rev_sha1 seems very crazy :)

If there is combined rev_sha1 that is built by any strategy (crazy or not) in the database, then it should probably also exist in the XML.

Tgr added a comment.Oct 11 2018, 12:50 AM

@Halfak the questions is: given that XML readers are somewhat flexible, so a script that was written some time ago and has no knowledge of MCR will still be able to read MCR dumps and see all the fields it expects, and it will quietly ignore non-main slots as unknown fields, what should such a script see at the location where it looks for the sha1 of the revision?

We could put the revision sha1 there, in which case it would be an accurate flag for reverts but would not match the actual sha1 of the "revision content" (as far as such a legacy script understands, "revision content" is just the main slot) when there are additional slots. Also diffs that the script expects to be non-empty (since the sha1 is different) might in fact be empty (since there is no change in the main slot).

Or we could put the main slot sha1 there, in which case it would continue to be the sha1 of the "revision content" (as a legacy script understands it) and the sha1 would change if and only if the "revision content" (main slot) changes. But there could be two successive revisions with the same sha1 (which wasn't possible before) and in general there would be lots and lots of revisions with the same sha1 (when someone edits a slot but not the main revision) so the sha1 would be rather useless for revert detection.

The question is, which is less likely to break or confuse old clients? Now that I wrote this up it seems pretty obvious that the first option (using the combined sha1 in the place of the old sha1) is superior.

Halfak added a subscriber: mako.Oct 12 2018, 5:02 PM

Aha! I don't think anyone is generally comparing the content text to a specific checksum for any reason (except some old studies by @mako to check to see if the checksums were consistent historically (they aren't). So, I'm a fan of the combined checksum showing up in place of the main <sha1> -- especially if this is already how it is implemented in the DB with rev_sha1.

mako added a comment.EditedOct 12 2018, 5:55 PM

I think it would be more surprising to have the same SHA1 for two different (in any way) revisions than it would to have a SHA1 that reflects something other than the SHA1 of the primary/historical content field. So I guess I like the revision-level
<sha1> reflecting the hash of all parts of the revision.

As @Halfak suggests, checksums in the dumps currently aren't always reproducible from the text of all dumps (admittedly, these exceptions are rare and this fact is not widely known) so it would have been unwise to have been counting on this in the past.

TechCom is placing this on Last Call ending Wednesday 24 October 2pm PST (21:00 UTC, 23:00 CET)

@Halfak to clarify - you originally said the top level <sha1> tag should be the hash of the main slot's content, but you now let yourself be convinced that it's more useful to have the combined revision hash there?

As the above conversation seems to converge on having the combined hash in the revision level <sha1> tag, this raises the question where to put the main slot's content hash. I personally prefer to use an attribute on the <text> tag, but we could also go for <content-sha1> or <content-hash algorithm="sha1" format="base36">.

For clarity, I was originally advocating that we didn't combine any hashes and that instead we provided a <sha1> tag in each of the <content> slots. I now see that we're going to make a mess in favor of backwards compatibility. So there will continue to be a <sha1> tag and <text> at the top of the <revision> tree anyway. (Note that this now differs from the mwapi structure that is forcing everything into "content" and throwing loud warnings if you expect text to appear where it used to be). In that case, I'd like the structure of the mediawiki database to be reflected here. E.g. <revision>'s <sha1> should reflect the rev_sha1 field.

As for the sha1 of the main slot (<revision>'s <text>), I don't have a strong opinion. Attributes and tags require the same code complexity to deal with when processing the XML dumps. If you decide on a tag, I do think that "content-sha1" is needlessly ambiguous. It could be "text-sha1" to reflect the name of the <text> tag or maybe "main-sha1" to reflect the concept of the "main slot".

OK, it looks like everyone's weighed in. so I'll suggest:

<revision>
  <id>308722154</id>
  ...
  <format>text/x-wiki</format>
  <text location="tt:305112983" sha1="xxxx" bytes="143" />   <-- contains sha1 of content in main slot
  <sha1>a9kdtqq3buy5tribez2u0ad4b6fdxq2</sha1>               <-- revision sha1

I'm not excited about the attribute but I like it better than an extra tag.

Looks good to me!

TechCom has moved Last Call to end Wednesday 31 October 2pm PST (21:00 UTC, 22:00 CET). This is due to no TechCom meeting next week to vote on final approval.

I've added the Sha1 as an attribute of the Text element in all slots; this means removing it as an element from the Content element. I think the new markup reflects these changes correctly, please double-check though.

As to 'optional' or 'required', I'd like to remove all notations of 'optional' and add a comment near the top of the schema reminding folks that attributes by default are optional.

I've updated the rfc proposal removing all 'use=optional' markup as described above. This should probably get brought up during last call at this week's TechCom to make sure everyone's ok with it.

Approved as amended, per today's TechCom meeting

daniel closed this task as Resolved.Nov 5 2018, 3:49 PM
greg added a project: Multimedia.Mar 7 2019, 10:59 PM
ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Apr 1 2019, 1:59 PM