Page MenuHomePhabricator

Add fields needed by ERI to mediawiki.revision-create
Closed, DeclinedPublic

Description

The following data types that are not currently included in mediawiki.revision-create were identified by the team as being useful for ERI purposes. They should be added to ReviewStream:

Metadata about edits

  • Change tags*
  • Size of edit

Metadata about pages

  • Page views [someone said lego was working on an extension to make this easier] [citation needed]
  • Edit protection level  (e.g. full protection, semi-protection/autoconfirmed users only, no protection)
  • Whether the page is a disambiguation page.

Metadata about users

  • Number of edits
  • Registration date

*Regarding change tags, there are about 120 in en.wiki. But I believe tags vary in different wikis. Is there a way to include them all, or do they have to be added wiki by wiki? There is a related task about adding tags to RCStream.

Event Timeline

I propose to add the missing data to mediawiki.revision-create to make it available to ReviewStream.

The alternative (querying the API to get those) adds latency and potential API changes.

SBisson renamed this task from Add metadata not currently in RCStream to ERI feed. to Add fields needed by ERI to mediawiki.revision-create.Sep 22 2016, 10:36 AM

IMHO the pageview data doesn't belong in the mediawiki.revision-create schema at all.

We could add this data later in ChangeProp same as we would add the ORES data.

IMHO the pageview data doesn't belong in the mediawiki.revision-create schema at all.

We could add this data later in ChangeProp same as we would add the ORES data.

I agree, and there's already a good API to get this data.

IMHO the pageview data doesn't belong in the mediawiki.revision-create schema at all.

I don't understand this issue. the goal of including this data was for reviewers to be able to, for example, review more popular pages first. What is the impact of not including it? Does that mean downstream applications would have to look it up themselves? Our goal was to make that process easier for them.

Meanwhile, I got an answer back from Lego. I asked him if he was working on something in this area:

Yeah, that's the WikimediaPageViewInfo extension
(https://phabricator.wikimedia.org/T125917). It's kind of stalled on
getting onto beta cluster...let me poke that ticket again :/

I'm not sure how much my work will help though, the stuff Anomie/Tgr are
doing with adding the information to the API would be more useful I
think: https://phabricator.wikimedia.org/T144865.

I don't understand this issue. the goal of including this data was for reviewers to be able to, for example, review more popular pages first. What is the impact of not including it? Does that mean downstream applications would have to look it up themselves? Our goal was to make that process easier for them.

As I understand, we're going to set up a separate topic/schema for the review events (name for the topic TBD) Here we're taking about modifying the original media wiki.revision-create event, emitted by the Event-Platform mediawiki extension and used by us for a lot of other purposes. ChangeProp would follow the revision-create events, add additional data like ORES and pageviews data and post a new augmented review event to a separate topic, so the user would be able to get all the data.

So the question here is about which data to include in the original revision-create and which to add in ChangeProp, and I'm saying that pageviews data doesn't belong in the original event.

Change 312274 had a related patch set uploaded (by Sbisson):
Adding fields needed by the ERI project

https://gerrit.wikimedia.org/r/312274

Change 312277 had a related patch set uploaded (by Sbisson):
Adding data needed by ERI

https://gerrit.wikimedia.org/r/312277

https://gerrit.wikimedia.org/r/312274 is for the schemas describing the events. They have to be updated since the eventlogging service validates that events confirm to their associated schema. It rejects them if they don't.

https://gerrit.wikimedia.org/r/312277 is for producing the new information in the EventBus extension.

These patches cover the following fields:

  • user_edit_count
  • user_registration
  • rev_bytes_changes
  • page_edit_protection
  • page_is_disambiguation

The following fields have been omitted:

  • page views: Page view data can be aggregated into ReviewStream in the changeprop service, using the pageview API. It can be part of this ticket but cannot be done until ReviewStream is setup.
  • change tags: There's long discussions in T24509 and https://gerrit.wikimedia.org/r/#/c/194458/, over almost 2 years, and the patch is still not merged. I would suggest splitting this out into another task so this one can move forward.

There's a discussion in gerrit about including user_edit_count in mediawiki.revision-create.

Edit count is a concept that's been there for a long time in mediawiki. It includes all the edits, regardless of namespace and regardless of whether they end up being reverted or not. Because of that definition, it tells very little about the user's journey and his actual experience level.

Should we include it with, maybe as user_total_edit_count to better represent what it is? Or should we maybe clarify our definition or "newcomer" to figure out how many and what kind of contributions we need to count?

@jmatazzoni Which events do we want to include in ReviewStream. I was about to only include edits but I prefer to confirm as it does influence some of the technical decisions.

These are the events that are readily available and can be included:

  • edit
  • page move
  • page deletion
  • page undeletion
  • change revision visibility (this happens when an edit is considered so damaging that it has to be not only reverted but hidden from history)

Change 312274 abandoned by Sbisson:
Adding fields needed by the ERI project

https://gerrit.wikimedia.org/r/312274

Change 312277 abandoned by Sbisson:
Adding data needed by ERI

https://gerrit.wikimedia.org/r/312277

There will be no fields added to mediawiki.revision-create

I see @SBisson's note on declining this task that

There will be no fields added to mediawiki.revision-create

It sounds like there are questions and comments about various data types originally requested:

  • Number of edits (is the total edits good enough?)
  • Page views (suggestion to open a separate ticket)
  • Change Tags (suggestion to open a separate ticket)
  • Redirect (deleted--why?)
  • User group (deleted--why?)

Meanwhile, Stephane says there are patches for the following, but it looks like the patches were abandoned:

  • user_edit_count
  • user_registration
  • rev_bytes_changes
  • page_edit_protection
  • page_is_disambiguation

Recall that the original task title just called for us to include this data in the "ERI feed." Regardless of the method used (e.g., whether we include the data in mediawiki.revision-create). If the data isn't in the feed yet, isn't it premature to close this task (without, say, opening others)?

Stephane, what should we do here? I'm reopening the task so we don't lose sight of these objectives.

mobrovac subscribed.

Ok, I think we should continue work here, but re-evaluate the way we are to do it. The problem I see with the abandoned patches is that, in order to add the aforementioned fields to revision-create, they were also added to other schemas in the name of consistency. I could be persuaded that adding page restrictions to page* events makes sense, but I really feel that adding user_edit_count to all schemas (because all of them contain information about the user) is not only superfluous conceptually, but dangerous performance-wise as we would need to be querying the master DB for every event, which on WMF's scale means it could seriously impact the performance of the system as a whole (MW is a monolith, after all).

Ultimately, we will have a review-stream event (name subject to bike-shed, ofc), which will combine information from the revision-create event and ORES' scores. When a new revision-create event comes in, requests to both MW and ORES could be sent in parallel to obtain the extra information and then merged into review-stream. The execution time would be effectively dominated by ORES (as it takes approximately 2.5 seconds for it to respond).

@SBisson, you voiced concerns about the consistency of data if it is not obtained directly inside the hook. However, the average time from revision creation to its event consumption is way below one second (click on page_edit in the graph to see it), which makes it highly unlikely for the same human user to perform another action in the meantime. As to the DB replication lag, we have been using the MW API from all of our services and haven't seen any major inconsistencies thus far. Even if the user's latest edit isn't counted at the time of the API's response, we are talking about being off for a maximum of one edit, which doesn't really change the fact of whether the user is a new-comer or not.

No movement since 2016. Closing.