Currently, the SHA1 hash in the revision table is based solely on the serialized content of the revision. The new code in RevisionSlots behaves the same as the old code in Revision in that respect, except that it combines the hashes of all slots.
However, in order to provide the intended semantics ("revisions with the same hash have the same content"), this hash should differ for revisions that have the same content in different slots, or in the same slot with a different content model. Wikitext [[5]] is not the same content as the JSON [[5]].
Re-defining the hash function would be simple, but that would break compatibility with all existing hashes in the database. Only including the model and slot role if there is more than one slot would be an option, but would be confusing.
In order to safely re-define the hash, we should introduce a prefix that allows us to identify the hash function in the hash string itself, similar to the way the prefix works for password hashes. However, in order to have room for such a prefix in existing 32 character rev_sha1 field, we'd have to change the encoding to base64 first.
Proposal:
- Change RevisionSlots::calculateSha1 to return a hash constructed as follows:
- sort all slots by slot role name
- for each slot, construct a string consisting of the slot role, the content model, and the slot's content hash, separated by pipe characters (|).
- concatenate all these strings, separated by \n (CR).
- Calculate the Sha1 hash of the result, and encode it using base64_encode( hex2bin( $hash ) ). This gives a 28 character string.
- prefix the result with "31:" (referring to MW 1.31). This results in a 31 character string.
- Return that 31 character string as a hash.
These new-style hashes are incompatible with the hashes we previously generated, but they are easily distinguishable based on the presence of the ':' separator.
Note that with the new MCR schema, the old base36 sha1 of each slot is still available in the cnt_sha1 field of the new content table. The hashes in the rev_sha1 field can be re-generated using the new algorithm using the information in cnt_sha1.
Note however that rev_sha1 is not used for anything in MediaWiki itself. It exists mainly for diagnostic purposes, and for the benefit of tools on Toolforge.
Note that this is a breaking change to anything that exposes the revision hash, including the respective API modules and the XML dump format.