Page MenuHomePhabricator

Oversighted Revisions on HTML Exports
Closed, ResolvedPublic

Description

Following up from T257480.

@ArielGlenn made a good point about oversighted revisions not being handled currently in our system.

In the event handler code, I don't see anything that deals with oversighted or hidden revisions. I'm not sure if it's guaranteed that a page will be re-rendered after oversight or hiding of the current revision. Maybe we need to have a closer look at this. Typically revisions are hidden or oversighted because of egregious copyright violations or really offensive material relating to living persons, see https://en.wikipedia.org/wiki/Wikipedia:Revision_deletion#Criteria_for_redaction so we definitely want to catch those. I hope that these are visible via the recent changes event stream; maybe someone who knows better can weigh in on that.

Wanted to start a thread to find the best way to get this stream into our current system. Ideally, if this were a part of EventStreams that would be ideal...is that possible @Ottomata? Is there a way to just surface a revision is oversighted and we can handle as it is currently omitted.

FYI - creating a new ticket just to start to formalize our phab use here. Hope this is the right way to do this :p

Event Timeline

It should be possible! The steps would be

  1. Create an event schema in https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/primary/
  2. Add code in in the EventBus extension to emit the event. Hopefully there is a MW hook for when these actions happen. If not, we'd have to make one in MW.
  3. Add stream config for the new stream(s) in wgEventStreams in the operations/mediawiki-config repo.
  4. Add the new streams to the list of streams EventStreams is allowed to expose (we can do this part).

The instructions for instrumentation events are kind of EventLogging specific, but the parts about creating a schema and stream config are the same. (I should make a how to doc for MediaWiki model events in EventBus too...)

Core Platform and @Pchelolo can probably help you more. Also, you should probably discuss this with legal; sometimes even the fact that a revision is hidden is private info.

Awesome! Thanks @Ottomata -- checking with legal on this.

Create an event schema in https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/primary/
Add code in in the EventBus extension to emit the event. Hopefully there is a MW hook for when these actions happen. If not, we'd have to make one in MW.
Add stream config for the new stream(s) in wgEventStreams in the operations/mediawiki-config repo.

All these steps are already done. We have had revision-visibility-change event since the beginning.

Oh, did not realize that 'oversighted' == visibility change. Ok! Then yeah you just need to figure out if we are allowed to expose it publicly. Here's the schema.

I'd expect legal to say we are not allowed to expose this publicly.

I'd expect legal to say we are not allowed to expose this publicly.

I'm not a layer, but I would expect the opposite - we do not actually include any of the suppressed data into the event, and the event is emitted after the data was already suppressed, so we're not really exposing much here. But again, I do not know.

I think that sometimes just the fact that a revision was suppressed is private.

Talked with Tony and he's double checking, but since it's just exposing an event that happened - it might be a bigger liability to have these "bad revisions" live in our end exports (dumps) without noting they were suppressed. He's checking with a few folks and I'll sync back here with his findings.

Thanks all btw, @Ottomata and @Pchelolo - this is super helpful for us. Really appreciate your help on this

:begintroll: Or you could move the html dump generation inside production and consume from Kafka cough cough :endtroll:

:p

it might be a bigger liability to have these "bad revisions" live in our end exports (dumps) without noting they were suppressed.

Adding @JFishback_WMF for a privacy perspective. Given that os'd revisions can feature many types of sensitive content up to and including PII and specific threats, I would assume we would want to limit their exposure and likely any record of their occurrence as much as possible. I believe this is the assumption of many within the community and we've certainly dealt with any number of security-related bugs where os logic has been neglected or ignored.

So we are really focused on the "best last revision" of articles across the wikis and not adding historical revisions into the exports (dumps). Thus, whatever version we have of a revision should not include nor ever include a revision that is sensitive. If we were dumping historical dumps I think a record would make sense, or if we are providing historical dumps - which as of now, we aren't - just download a "non-sensitive" view and come back later to do it again. Some of the past exports could live on machines though, which could potentially have something that was oversighted after we compiled the dump...

Not entirely sure if that's what you're saying, just wanted to give some more context.

To have more clarity, the schema of the event we're talking about can be viewed here: https://gerrit.wikimedia.org/r/plugins/gitiles/schemas/event/primary/+/refs/heads/master/jsonschema/mediawiki/revision/visibility-change/1.0.0.yaml The revision deletions are present in MW log table and can be viewed in Special:Log. I think we are not including more info into the event then is already present in public log table.

Not entirely sure if that's what you're saying, just wanted to give some more context.

The latest revision of the page can not be suppressed, only a historical revision, so while 'compiling' the dump, if you had bad luck getting a bad revision, it will be suppressed later and a newer revision will be created. Having a stream of revision suppressions will be helpful, cause while the dump is being processed and a revision-visibility-change event comes, you need to re-fetch the bad revision if it ended up being included in the dump.

For historical dumps however the situation gets trickier. I do not know enough about how you intend to store those, but in the perfect world you'd want to go back and drop it from the historic dump as well. But again, I'm not qualified to make that call.

Ok heard back from Legal on this - response below from Tony S:

I am comfortable with the exports including a record of redactions/oversights, provided (of course) that the exports don't include the redacted/oversighted content itself.

Neither of us are aware of how records of oversights are recorded/categorized at the Mediawiki level. The Alexander Weibel talk page is an example of how the oversight records appear on wiki; if you scroll down, you can see edit entries with struck-through links and the edits' abstract removed. Disclosing this via the dumps is fine. The exports should not include any metadata which could theoretically be used to reconstruct redacted content.

My understanding is that the only exposure we are providing here is to the notification that something was redacted/oversighted so we should be good to go. Our system will handle the rest of this by scrubbing that revision (in the case that it's present) from our database and remove it from prior exports.

With that, we should be good to go to add the notification to the event stream (assuming everything I said above is correct :)).

Change 630262 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Eventstreams: expose revision-visibility-change events.

https://gerrit.wikimedia.org/r/630262

Change 630262 merged by jenkins-bot:
[operations/deployment-charts@master] Eventstreams: expose revision-visibility-change events.

https://gerrit.wikimedia.org/r/630262

So we are really focused on the "best last revision" of articles across the wikis and not adding historical revisions into the exports (dumps).

The latest revision of the page can not be suppressed, only a historical revision

Hm, if this is true, why do you need the revision-visibility-change events? Won't the latest revision-creates ensure that you have an unsuppressed revision at the time of the dump?