Page MenuHomePhabricator

Capture rev_is_revert event data in a stream different than mediawiki.revision-create
Closed, ResolvedPublic

Description

rev_is_revert will be removed from mediawiki.revision-create to fix T215001: Revisions missing from mediawiki_revision_create. We should try to provide this info in another event stream.

From some discussions, it seems we may be able to do this by:

  1. Tagging revisions with some kind of is revert tag, perhaps also with rev revert details info too?
  2. Making the mediawiki.revision-tags-change stream public in EventStreams.

Does revision-tags-change have PII? Can we even make it public? I'd expect revision tags to already be public info, so I think so? Tagging Privacy-Engineering for help.

Alternatively, we could create a new mediawiki.revision-revert stream.

Event Timeline

JFishback_WMF moved this task from Incoming to Backlog on the Privacy Engineering board.

Some potential concerns:

  1. There's a property within the revision-tags-change schema which may elicit similar issues as in T241410.
  2. For the actual data, we need to be sensitive of mw suppressed titles and usernames and have some means of redacting such data from various streams if it accidentally leaks (typically due to delayed suppression actions).
fdans moved this task from Incoming to Event Platform on the Analytics board.

Interesting.

For 1., is the only potential issue the chronology_id? I can't totally remember what happened to resolve T241410, but was it to stop sending chronology_id in the event?

For 2., this is the case for all public streams and datasets, no? Do we need to fix this for this one stream in order to make it public? I think the revision-create stream itself could benefit more greatly from solving this problem.

For 1., is the only potential issue the chronology_id? I can't totally remember what happened to resolve T241410, but was it to stop sending chronology_id in the event?

Yes, that one. Since that task is still protected (and likely always will be), it probably makes sense to review the issue there. This is a pretty good summary/real world example on the task: T241410#5764673. Again, I'm not certain that issue exists within the context of the data being discussed here, though if chronology_id behaves the same way, then I'd imagine it would.

For 2., this is the case for all public streams and datasets, no? Do we need to fix this for this one stream in order to make it public? I think the revision-create stream itself could benefit more greatly from solving this problem.

I think the Security-Team (though I can't speak specifically for Privacy Engineering) would likely rate this as higher risk if this or other streams could not accommodate this mediawiki data permissions model, since it would be introducing a Vuln-Infoleak.

if chronology_id behaves the same way, then I'd imagine it would.

Just looking at both revision-create and revision-tags-change, I don't see the chronology_id field being set. I don't see a linked code change on T241410, but I think we must have stopped emitting it.

likely rate this as higher risk if this or other streams could not accommodate this mediawiki data permissions model, since it would be introducing a Vuln-Infoleak.

Right, I guess the question is: how much worse is the potential for e.g. suppressed revisions (without any content, only page titles and comments) being consumable for 7 days, than the risk of having those suppressed revisions readable by current subscribers or readers, or in the xml dumps?

Just looking at both revision-create and revision-tags-change, I don't see the chronology_id field being set. I don't see a linked code change on T241410, but I think we must have stopped emitting it.

The revert and prod branch deploy for that issue are described here: T241410#5765879.

Right, I guess the question is: how much worse is the potential for e.g. suppressed revisions (without any content, only page titles and comments) being consumable for 7 days, than the risk of having those suppressed revisions readable by current subscribers or readers, or in the xml dumps?

Yes, that will need to be assessed by @JFishback_WMF and/or @sguebo_WMF as part of a (formal or informal) privacy review. I just wanted to call this out to set expectations as soon as possible.

chronology_id was initially added by Stas for Wikidata query service. Given that the search team is actively working on using flink for query service updater, I'm not sure they will be using chronology_id field. @dcausse can you shed some lite on whether you need the chronology_id field in the events?

If search team doesn't need the field, we can just drop it everywhere.

Revision-tags-change exposes pretty much the same info as revision-create, plus tags. which tags a certain revision has AFAIK can not be suppressed. Tags themselves afaik can not be suppressed.

chronology_id was initially added by Stas for Wikidata query service. Given that the search team is actively working on using flink for query service updater, I'm not sure they will be using chronology_id field. @dcausse can you shed some lite on whether you need the chronology_id field in the events?

We stopped relying on this and dropped all the references from the codebase. To solve the problems this field was trying to solve we'll rely on simpler heuristics (e.g. T279698).

If search team doesn't need the field, we can just drop it everywhere.

fine by me, you can drop it completely.

@Pchelolo putting aside the privacy question for the moment, could we add an is_revert revision tag? How is that even done, with job queue? Is it possible to add the extra info captured in the rev_revert_details field too? Is doing so a good idea?

@Pchelolo putting aside the privacy question for the moment, could we add an is_revert revision tag? How is that even done, with job queue? Is it possible to add the extra info captured in the rev_revert_details field too? Is doing so a good idea?

I will conduct some research and get back to you. It's complicated

I'm just here to mention that "2. Making the mediawiki.revision-tags-change stream public in EventStreams." is something that was also brought up in T266375#6608351.

There is now a revision tag to identify reverts (T254074), so the information is captured in the revision-tags-change stream. The work to make that stream public is covered by T294391.