Page MenuHomePhabricator

Set up the foundation for the ReviewStream feed
Closed, DeclinedPublic

Description

Publish a new topic ReviewStream containing everything from mediawiki.revision-create and mediawiki.log-events (T155804), plus the ORES scores for the models available in the current wiki.

Current idea is to do this as ( mediawiki.revision-create + ORES, etc. ) => mediawiki.review-stream-revision-create.

mediawiki.review-stream-revision-create + mediawiki.log-events => ReviewStream

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mobrovac added a subscriber: mobrovac.

PR #108 for ChangeProp separates the ORES rules into a spearate topic. This will allow us to use the result in order to populate the new topic.

Next step would be to create the topic in question. Is the idea to have all new revisions in there with their scores or only bad or good edits? If we produce all new revisions, should we also include an extra field, signalling if the revision is to be considered a good/bad edit based on a threshold?

Such a field wouldn't really make sense. Flagging is use-case dependent.

Generally, we recommend users consume the "model_info" endpoint in order to know what ranges of scores they are probably interested in. See https://ores.wikimedia.org/v2/scores/enwiki/damaging/?model_info for an example.

The following represents the most inclusive threshold

"filter_rate_at_recall(min_recall=0.9)": {
 "filter_rate": 0.753,
 "recall": 0.902,
 "threshold": 0.173
},

The following is a stricter threshold

"filter_rate_at_recall(min_recall=0.75)": {
 "filter_rate": 0.869,
 "recall": 0.752,
 "threshold": 0.492
},

Finally, this is the most-strict:

"recall_at_fpr(max_fpr=0.1)": {
 "fpr": 0,
 "recall": 0.072,
 "threshold": 0.959
},

@SBisson edited the task. Thanks. Just so everyone is clear, there is a product description for the eventual feed that this is the foundation for. (I will transfer this to mediawiki as soon as I can.)

Just so everyone is clear, there is a product description for the eventual feed that this is the foundation for. (I will transfer this to mediawiki as soon as I can.)

When do you plan to add that to MediaWik wiki's pages about ERI?

The spec for this feed is that it would include everything that was in RCStream. @SBisson changed that to mediawiki.revision-create, which I presume is the same but more accurate?

Just to be doubly clear, the spec for this initial-state of the ReviewFeed is for it to include the following properties, which were thought to be the properties that "come with" RCStream (these will be augmented by the properties listed in T145164):

Metadata about edits

  • Whether the user marked their edit minor
  • Whether the user marked their edit as a bot edit.
  • Whether it is a page creation
  • Date and time of edit
  • User who made the edit
  • Size of edit (in bytes -- can be derived)
  • Edit summary

Metadata about pages

  • Namespace
  • Title ( in two parts, namespace:pagetitle)
  • Length (Size of page)
  • Whether the page is a redirect

Metadata about users

  • Registered/Anonymous
  • Registration date
  • What user-groups the user is in (e.g. sysop/administrator, patroller, researcher, rollbacker, autoconfirmed).

Since we're going to base it on the revision-create event, here's the schema of the event: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/1.yaml

Here's what it already has out of your list of metadata:

Metadata about edits

  • Whether the user marked their edit minor
  • Whether the user marked their edit as a bot edit - don't have this one, but we know whether the user is bot
  • Whether it is a page creation - can derive from rev_parent_id being null
  • Date and time of edit
  • User who made the edit
  • Size of edit (in bytes -- can be derived)
  • Edit summary

Metadata about pages

  • Namespace
  • Title ( in two parts, namespace:pagetitle)
  • Length (Size of page)
  • Whether the page is a redirect

Metadata about users

  • Registered/Anonymous - can be derived from user_id being null
  • Registration date
  • What user-groups the user is in (e.g. sysop/administrator, patroller, researcher, rollbacker, autoconfirmed).

So, we're missing 3 properties:

  • Whether the user marked their edit as a bot edit
  • Size of edit
  • User registration date

Should I create a followup task to add those fields to the Event-Platform infrastructure?

The spec for this feed is that it would include everything that was in RCStream. @SBisson changed that to mediawiki.revision-create, which I presume is the same but more accurate?

From the spec document:

General Implementation Strategy—Public Events Feeds
The working theory is that we will build the new feed within the architecture being created for Public Events Streams. This enables us to take advantage of the infrastructure and features that project will supply.

In the new infrastructure, RCStream will be deprecated and the topic that's the closest to what we need is mediawiki.revision-create, as Petr pointed out.

So, we're missing 3 properties:

  • Whether the user marked their edit as a bot edit
  • Size of edit
  • User registration date

Should I create a followup task to add those fields to the Event-Platform infrastructure?

We're missing a few more, as listed in T145164: Add fields needed by ERI to mediawiki.revision-create

jmatazzoni renamed this task from Set up augmented changes feed to Set up the foundation for the ReviewStream feed.Sep 23 2016, 12:16 AM

Change 313850 had a related patch set uploaded (by Sbisson):
Produce ReviewStream from ORES extension

https://gerrit.wikimedia.org/r/313850

@SBisson why have you removed the projects from this task? And why have you uploaded a patch to produce events from ORES??? This is not the way to go about this problem. I thought we had already discussed it...

@mobrovac Thanks for your suggestion but we're still exploring different ways to go about this.

One of the problems with using only changeprop is that some of the data we need that's missing from mediawiki.revision-create cannot be reliably retrieved through API as it was at the time of the edit.

@mobrovac Thanks for your suggestion but we're still exploring different ways to go about this.

One of the problems with using only changeprop is that some of the data we need that's missing from mediawiki.revision-create cannot be reliably retrieved through API as it was at the time of the edit.

Hm, interesting. What data are you referring to specifically? Also, what do you mean by cannot be reliably retrieved through API as it was at the time of the edit ?

User edit count and page edit restrictions are two examples. They are not versioned by revId or timestamp. You can fetch them through the API afterward but the data you get is what's considered "current" and is affected by what has happened since and which replica db you happen to hit. I don't know how to track how often this data would be right or wrong. This is based on the data we want now. We're only just starting this project and we'd prefer not to hit a glass ceiling at every turn.

I honestly don't think that producing the feed from ORES is the best architecture. I would prefer it as a simple, config-driven aggregation of data reliably available in kafka. All the debate about what's in or out of the mediawiki.revision-create schema is justified when it defines what Kasocki exposes to the world but not what mediawiki sends to kafka. That those two are currently the same is unfortunate and seriously limiting the potential of the data platform.

User edit count and page edit restrictions are two examples. They are not versioned by revId or timestamp. You can fetch them through the API afterward but the data you get is what's considered "current" and is affected by what has happened since and which replica db you happen to hit. I don't know how to track how often this data would be right or wrong. This is based on the data we want now. We're only just starting this project and we'd prefer not to hit a glass ceiling at every turn.

The delay from the actual edit and the CP rule execution is about 10-20ms, so the data you would get would still be pretty consistent. And even with fetching it from the hook technically you don't get 100% consistency since the hooks are run after the transaction is complete, so there's still room for inconsistency.

I honestly don't think that producing the feed from ORES is the best architecture. I would prefer it as a simple, config-driven aggregation of data reliably available in kafka. All the debate about what's in or out of the mediawiki.revision-create schema is justified when it defines what Kasocki exposes to the world but not what mediawiki sends to kafka. That those two are currently the same is unfortunate and seriously limiting the potential of the data platform.

In my opinion if it's impossible to get the data afterwards it should be added to the event. Kasocki is a separate problem and we can delete it later if we want, but going for the worse general architecture because of a dispute about a couple of additional fields in the schema doesn't sound like a good idea to me.

I want to comment on what might be a misconception. @Pchelolo counts up what's available and what's listed above as desirable for this feed and concludes:

So, we're missing 3 properties:

  • Whether the user marked their edit as a bot edit
  • Size of edit
  • User registration date

I'll stipulate that I don't understand the technical issues under discussion. But I wanted to make sure everyone knows that the properties listed above are just the "foundation," as the task title puts it, of what we're building in ReviewStream. This current task was meant to define a starting point only, by listing what were thought to be a collection of easily available data. From there, we move forward with the other tasks on the ReviewStream board (all of which are slated to be completed this quarter).

I encourage anyone who wants to understand the scope and intent of the product to read the ReviewStream product description on Mediawiki.

I encourage anyone who wants to understand the scope and intent of the product to read the ReviewStream product description on Mediawiki.

From a quick review of that page, it seems that a lot of the properties listed there are a) generally useful for revision-related information, and b) are already included in revision_creation feed. Some (like, perhaps, the user's edit count) might be more suitable to the enriched feed discussed here, but there is no limit on the number of properties that can be added to such derived events.

Earlier in this thread @SBisson brought up concerns about the consistency of derived events. Strictly speaking, unless all data is gathered as part of the primary update transaction, basically none of the secondary information is guaranteed to be consistent with the exact time the edit was made. Running relatively expensive queries like edit counts on the master, as part of the edit transaction, would probably be unwise. Realistic hooks run after the edit was saved, so will pick up edits that happened in the split second since the commit, and likely using a slave db. To wait for ORES processing, you'd probably create a job, which then emits the event. This introduces further delays on the order of seconds.

With EventBus and ChangeProp, the processing delay is typically lower, on the order of < 100ms. To me, it seems that minor inconsistencies caused by processing delays of significantly less than 1s are unlikely to matter in practice.

Overall, a big downside I see in creating your own, custom event is that you won't benefit from any of the work that is being done to improve & maintain the general revision update events. Instead, you would duplicate a lot of that work, and would add another overlapping source of revision information that is subtly incompatible with the regular edit event stream. The other way around, regular revision creation events would potentially miss out on some properties that would be more generally useful.

So, I hope that we can find a way to minimize the bikeshedding & duplication, and move forward together.

So, there has already been an attempt to add the fields to the revision-create schema (cf {T145164). I think we should continue working in a similar direction. I have added a status summary and an idea of how in T145164#2698884 but the TL;DR is to use ChangeProp to react on revision-create events and augment it with needed data by issuing requests to ORES and the MW API.

+1 to what @GWicke and @mobrovac are saying here. Many of the fields that were proposed to be added to revision events are useful and easy to get, its just that some aren't. We should add the easy and useful ones to revision-create, and augment a new stream with the more difficult and less general ones.

@jmatazzoni @SBisson, can we set up a short meeting to sync up about this?

Pchelolo edited projects, added Services (watching); removed Services.

Change 313850 abandoned by Sbisson:
[PoC] Produce ReviewStream from ORES extension

https://gerrit.wikimedia.org/r/313850

In the new infrastructure, RCStream will be deprecated and the topic that's the closest to what we need is mediawiki.revision-create, as Petr pointed out.

But per recent discussion, we also want logging, which is useful and part of RCStream. At the Dev Summit we discussed doing this as follows:

  • revision-create - Anything about revisions that MediaWiki has at edit time, except derivative data (see below). Per the dev summit meeting, we would take a very broad (firehose) view of what revison-related information is allowed in this topic, and avoid having to look stuff up later.
  • log-events - Similar to revision-create, but for logging (page moves, page deletion, upload, etc.). T155804: log-events topic emitted in EventBus
  • review-stream-revision-create - revision-create augmented with ORES, and anything else that is ReviewStream-specific. This also includes derived data. For example, revision-create would include registration date and edit count. But the derived data (e.g. > 100 edits, > 1 month => learner) can be done here.

ReviewStream = review-stream-revision-create + log-events

(There might or might not also be: EventBusWikiChangeEventsNewInfra = revision-create + log-events, but that is out of scope of this task).

[...]

  • review-stream-revision-create - revision-create augmented with ORES, and anything else that is ReviewStream-specific. This also includes derived data. For example, revision-create would include registration date and edit count. But the derived data (e.g. > 100 edits, > 1 month => learner) can be done here.

The user experience level is currently computed in User.php and it depends on 4 config variables that are expected to be tweaked per-wiki. Are you suggesting we duplicate this logic and config into changeprop?

Today we had a ReviewStream meeting. We had originally planned to talk about how the 'review-stream-revision-create' content would actually look, but instead spent the meeting hearing from @Halfak and discussing use cases and alternate ways to present this data. Meeting notes are here: https://etherpad.wikimedia.org/p/ReviewStream. I'll bring over my summary from that meeting. Please add to or clarify if anybody remembers things differently.

ReviewStream Meeting 2017-01-24 Summary

Aaron presented and in the subsequent discussion made 2 main points:

  1. A queryable API of changes to review is more useful to edit review tool developers than a stream.
  2. Streams are useful too, but delaying them so events (e.g. revision-create + ORES scoring) can be merged isn't worth it.

1. API vs Stream

On load, a useful edit review interface will likely need to have the recent edits that need to be reviewed. Especially on small wikis, you don't want to have to wait for new edits to roll in before your interface will present them. Once loaded, new edits for review could be added to the interface by consuming them from a stream, or also by making another API query. Edit review tool developers should be asked if this makes sense to them.

2. Don't merge (and delay) different types of events

An edit review tool will have to respond to many different 'delayed' events about a revision. Log events, patrol events, reverts, ORES scores, etc. Merging different event types, like revision-create and ORES scoring into a single event is awkward, and is not that useful, especially since you will need more event information as it happens in the future anyway. It's also is not very future proof, as a merged topic needs to created and maintained, and it is difficult to change.

Proposal (as Otto understands it):

ReviewStream should be an EventStreams endpoint, e.g. stream.wikimedia.org/v2/stream/editreview (name still TBD), and it should union relevant event topics together. Any relevant event topic that does not yet exist should be created (e.g. log events, revision-score) and
be included in this union stream.

Additionally, Aaron thinks an API that exposes current state about recent edits to review would be very useful. Much of this is already available via the RecentChanges API. More information (like ORES scores) could be added to this API, either stored in MW DBs or in a separate state store. And/or a new API could be developed that better fits the Edit Review use case. This edit review state could be updated by the event streams (by something like change-prop, or whatever), OR it could be updated by Mediawiki at event time (where this is possible). Work on this API would likely be a different project than ReviewStream.

Proposal (as Otto understands it):

ReviewStream should be an EventStreams endpoint, e.g. stream.wikimedia.org/v2/stream/editreview (name still TBD), and it should union relevant event topics together. Any relevant event topic that does not yet exist should be created (e.g. log events, revision-score) and
be included in this union stream.

Relevant note @Ottomata raised at the Dev Summit. This is an issue with any union stream, with or without a delay: If log-events and e.g. revision-create are different topics, later union-ed together, there is no guarantee the time ordering will be strictly preserved across the different topics. Rather, it will be mostly in order.

Additionally, Aaron thinks an API that exposes current state about recent edits to review would be very useful. Much of this is already available via the RecentChanges API. More information (like ORES scores) could be added to this API, either stored in MW DBs or in a separate state store.

ORES scores are already in MW extension DBs, and in the API (both the ability to include the scores in output with 'oresscores' property, and the ability to "Filter out non-damaging and unscored edits." and "Filter out damaging edits." with oresreview/!oresreview) (you can use one or both features at once).

[...]

  • review-stream-revision-create - revision-create augmented with ORES, and anything else that is ReviewStream-specific. This also includes derived data. For example, revision-create would include registration date and edit count. But the derived data (e.g. > 100 edits, > 1 month => learner) can be done here.

The user experience level is currently computed in User.php and it depends on 4 config variables that are expected to be tweaked per-wiki. Are you suggesting we duplicate this logic and config into changeprop?

You're right, I forgot it was per-wiki. With the Dev Summit architecture, it probably makes more sense to also do that in revision-create.

Hi yall, just curious. What's the word? :)

FYI, these is also an interest in attaching ORES WP10 deltas to new revisions, e.g. T145829: Trending API should consult ORES. Perhaps this could also be part of ReviewStream?

Just following up from T145829:

In addition to wp10, the other revision models would also be useful here:
damaging, good faith, and reverted

The algorithm looks at edits and calculates trending articles based on the flow and quality of the edits, so knowing whether the edit is potentially damaging or likely to be reverted are useful signals.

Just one quick note. It's essentially free to apply multiple models at the same time from ORES' point of view. E.g. scoring 123456789 using the "wp10" model should take about 1 second. Scoring 123456789 using the "wp10", "goodfaith", and "damaging" models should also take about 1 second -- so long as the request is made to score all of these models at the same time. E.g.

https://ores.wikimedia.org/v2/scores/enwiki/?models=wp10|damaging|goodfaith&revids=123456789

This is the behavior that ChangeProp is configured to do right now.

In addition to the Trending API, @mobrovac noted that this would also be useful for helping to add wp10 data to the RESTBase summary endpoint: T157132: Add ORES articlequality data to summaries?

No movement since 2016. Closing.