MediaWiki supports a revision deletion feature to deal with copyright violations or the publication of private data. This feature lets users with specific rights restrict the visibility of user-supplied data in specific revisions. In some cases, the entire revision is suppressed, while in other cases the edit comment, user name or revision content are suppressed selectively.
RESTBase will store content beyond a relatively short cache timeout, so we need to make sure that we enforce the same per-revision restrictions. We could implement this in two phases:
Option 1: Check with the MediaWiki API on each request (slow but simple)
In this variant, we'll send a request checking protection status to the MediaWiki API in parallel with each revision request. We should choose a cheap API entry point for this, since all information we need here is 'user with this cookie (if any) can read revision X'. A similar userCan entry point will be useful for the authentication service project, so we could consider investing some time into creating a dedicated entry point if no good current entry point exists. When the MediaWiki API returns a negative answer, we deny the content to the user even if we have it in storage.
Alternatively, we can simply retrieve revision information for each request. The API end point exposes userhidden, sha1hidden and commenthidden flags. This is easy to implement, so we could try to implement option 2 first, and if we run out of time, just fall back to option 1.
Option 2: Track changes in MediaWiki & store block state in restbase
The Parsoid PHP extension also already implements tracking of revision deletions, so we could use this information to update restriction information in RESTBase. Once we are reasonably satisfied with the stability of this tracking, we can stop performing requests to the PHP API when no restrictions are indicated in the revision table.
One option is to add a set<string> (or JSON?) attribute to the revision table maintained in each pagecontent bucket to track these restrictions. Doing this on the revision table means that we'll have to resolve a timeuuid to a revision on each request. Timeuuids of individual properties don't necessarily match a revision's timeuuid, so we can't just use a secondary index on the revision table.
An alternative would be to represent revision deletions separately from the revision table, say as a table of page name x time range. Time ranges can also easily represent blocks that reach into the future (for revisions that don't exist yet). This could allow a much more compact representation of these restrictions. It might be possible to encode all restrictions for a bucket in an in-memory bloom filter, which would save database requests for blocks in the common case.
It might also make sense to handle user name blocks separately, as they would typically be global & can be represented by a relatively short list.