Page MenuHomePhabricator

Handle revision deletion and suppression in RESTBase
Closed, ResolvedPublic

Description

MediaWiki supports a revision deletion feature to deal with copyright violations or the publication of private data. This feature lets users with specific rights restrict the visibility of user-supplied data in specific revisions. In some cases, the entire revision is suppressed, while in other cases the edit comment, user name or revision content are suppressed selectively.

RESTBase will store content beyond a relatively short cache timeout, so we need to make sure that we enforce the same per-revision restrictions. We could implement this in two phases:

Option 1: Check with the MediaWiki API on each request (slow but simple)

In this variant, we'll send a request checking protection status to the MediaWiki API in parallel with each revision request. We should choose a cheap API entry point for this, since all information we need here is 'user with this cookie (if any) can read revision X'. A similar userCan entry point will be useful for the authentication service project, so we could consider investing some time into creating a dedicated entry point if no good current entry point exists. When the MediaWiki API returns a negative answer, we deny the content to the user even if we have it in storage.

Alternatively, we can simply retrieve revision information for each request. The API end point exposes userhidden, sha1hidden and commenthidden flags. This is easy to implement, so we could try to implement option 2 first, and if we run out of time, just fall back to option 1.

Option 2: Track changes in MediaWiki & store block state in restbase

The Parsoid PHP extension also already implements tracking of revision deletions, so we could use this information to update restriction information in RESTBase. Once we are reasonably satisfied with the stability of this tracking, we can stop performing requests to the PHP API when no restrictions are indicated in the revision table.

One option is to add a set<string> (or JSON?) attribute to the revision table maintained in each pagecontent bucket to track these restrictions. Doing this on the revision table means that we'll have to resolve a timeuuid to a revision on each request. Timeuuids of individual properties don't necessarily match a revision's timeuuid, so we can't just use a secondary index on the revision table.

An alternative would be to represent revision deletions separately from the revision table, say as a table of page name x time range. Time ranges can also easily represent blocks that reach into the future (for revisions that don't exist yet). This could allow a much more compact representation of these restrictions. It might be possible to encode all restrictions for a bucket in an in-memory bloom filter, which would save database requests for blocks in the common case.

It might also make sense to handle user name blocks separately, as they would typically be global & can be represented by a relatively short list.

Event Timeline

GWicke renamed this task from Track revision deletion and suppression in RESTBase to Handle revision deletion and suppression in RESTBase.
GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke changed Security from none to None.
GWicke added a subscriber: aaron.
GWicke subscribed.

For v1, shall we press forward with phase 1 as described above?

@Jdouglas: If the revision info requests already give us the information we need about suppression / deletion status, then we could just be lame & do a revision info request to the PHP API on each restbase request. That would be easy to implement on our end, so I don't think we need to do so right now (after verifying that this would work).

Re updates: As a first step, we could fork the Parsoid PHP extension & make it pass the right information about revision deletion in its requests to RESTBase. We might want to nominate a specific end point for this, which then does the dependent internal requests.

In the longer term we probably want to use a more general & reliable pub/sub system like Kafka for event dissemination. Ticket for that forthcoming.

Should T1228 be blocked by this task? It seems to me the proper resolution of this task depends on T84962 and/or T84923 (for which we have agreed not to tackle before v1)

T1228 already depends on this task

@GWicke I know, that's what I am questioning :P From what I can tell, unless we do a hack or something, this task is not going to be completed before the v1 rol-out. Or is T1228 =! v1 ?

@mobrovac, T87520 is the quick & dirty solution. It involves not much more than forking the parsoid extension, doing some search & replace for the names, and a bit of minor request parameter tweaking to reflect the new URL layout and update headers. In other words, I think it's very doable before the release.

Going with option 1b:

  1. check in our storage in parallel if we have info for a given revision (addressed in T88652), which will also make the appropriate API calls if the info is missing
  2. if obtaining the revision info does not throw an error, proceed; otherwise bail out

The revision API will throw an error if:

  • the revision ID is an invalid one
  • the revision does not exist
  • the revision has any type of restrictions applied to it (for simplicity reasons)