Page MenuHomePhabricator

To provide performer array in RC stream
Open, MediumPublic

Description

Hello. For apps such as SWViewer and problaly Huggle, it would be very useful if RC stream to provide performer array: data about user who made edit:

  • groups of user
  • count of edits
  • data of registration
  • tags

This will greatly reduce count of API calls and speed up working of apps for end user.

There is revision-create stream, but this stream not contain patrolled, old rev length and tags property. Maybe anybody can make one perfect stream from these two streams? :)

Event Timeline

@Ottomata Hello, how do you think, can this be done?
At least patrolled and old rev length (without tags), just copy-paste source code from recentchanges stream. It shouldn't be difficult, but it will greatly easer development of tools / speed up tools performance.

Sorry for my English.

Ottomata added a subscriber: Pchelolo.

Hi! Sorry I didn't see this before; I don't often look at the EventStreams tag.

Hm, so either we add performer to recentchange, or add the info you need to revision-create. recentchange is a bit of a grab bag, so it'd probably be less effort to add performer there; on the other hand, we don't really like recentchange because its schema is old and not well enforced.

I'd prefer if we did the latter and added the info you need into revision-create, but I'll have to check with some teams, and in MediaWiki EventBus extension to see how complicated it would be.

Are patrolled, old_rev_length and tags all you need?

Hm, actually, we do have a revision-tags-change stream internally, we could probably expose this stream publicly too. Then you could join revision-create with revision-tags-change to get the tags.

@Pchelolo any thoughts on this? I'd guess if this info is already in recentchange, there aren't any privacy implications to exposing them.

Hm, so for patrolled - technically this would be possible to add to revision-create, will require quite some coding to pass the info around. Also, the patrolled status can change after the revision is created, so this I think might be better represented by a separate, joinable stream.

Same for tags. WE already have tags change event, I'm not sure however if it's emitted for initial revision tags.

old_rev_length - possible to add, will require querying previous revision from the database, but I don't think it's a blocker since it is done in a deferred updated.

Adding performer to RC - it should have 'user' property which we technically can unwrap into full 'performer', however now we were just using default MW RC formatter, thus RC event emitted via EventBus is compatible with what the rest of RC event emitters do, so making special cases for EventBus would not be great.

I'd vote for trying to add required fields to rev_create, and if the fields are changeable (patrolled) - creating an additional stream for patrols to accompany revision-create stream.

Several combined streams are bad for mobile (higher traffic, mobile device load) and sometimes maybe dysynch. For works with them need to create "proxy" where info about edits will be "combine" before sends to user's device. But if not impossible make one stream, ok.

Are patrolled, old_rev_length and tags all you need?

If it not difficult, then gender.
Maybe user_is_anon for more correctly scheme (now we detect anons via regex or just if user_id not existing). This is not necessary, just for perfectionism :).

P.s: Wishlist 2021

Several combined streams are bad for mobile (higher traffic, mobile device load)

You can subscribe to multiple streams in the same HTTP connection, e.g. https://stream.wikimedia.org/v2/stream/page-create,page-delete . There will be duplicate data fields across some of these streams, but at least you don't have to manage more connections.

Also, it isn't going to be possible to put all data in one stream. revision-tags-change is a good example; tags change after a revision is created, so we can't capture that in revision-create. recentchange itself is basically a bunch of distinct streams jammed into one. Depending on the use case, recentchange might be more unnecessary data than joining just the streams you need, since you are getting every type of change.

For some useful stuff, we do try to include it in one stream when we can (that's why there's so much redundant data across the streams). So, let me summarize your request and you can confirm.

  • expose revision-tags-change publicly
  • create and expose a page-patrolled-change (am I correct in assuming that it is pages that are patrolled, not specific revisions?).
  • To revision-create, add:
    • patrolled - a boolean, true if the revision's page(?) was patrolled at the time of revision create.
    • If possible, add an initial revision tags field.
    • Add old_rev_length (probably we'll call this parent_rev_length).
    • Maybe add a performer.user_is_anonymous field
    • Maybe add a performer.gender` field (do we actually have this information? This one seems a little sketchy)

Is this a correct summary of your request?

Yes, I confirm this summarise.

am I correct in assuming that it is pages that are patrolled, not specific revisions?

As I understand (RCFeed / recentchanges’ schema), it’s revisions which not patrolled automatically (autopatrol).

do we actually have this information? This one seems a little sketchy

Hmm. API: usprop=gender.

fdans triaged this task as Medium priority.Jan 28 2021, 5:43 PM
fdans moved this task from Incoming to Event Platform on the Analytics board.

If recentchange has exactly the same data for edits as revision-create apart from these changes, I would be OK with it not having performer array, otherwise I would also prefer having performer array there. I can’t check this right now myself, but not having old revision length would’ve been a definite issue. (Not as in matching data keys, but as in all major data being present in both streams.)

Hey, Can we get any updates on this please? @Ottomata @Pchelolo

There isn't a lot of priority around this, but there is talk about making revision-tags-change public in T280538: Capture rev_is_revert event data in a stream different than mediawiki.revision-create.

I'm not sure, but https://gerrit.wikimedia.org/r/c/679353 may make it not possible to get the patrolled information in revision-create.

If anyone wants to try to augment revision-create with this information, we'd certainly accept some patches! The code that creates this event is in the EventBus extension (but note that the linked function is about to lose the EditResult parameter).