RecentChanges in Kafka
Closed, ResolvedPublic8 Story Points

Description

We need mediawiki to send RecentChanges data to Kafka. This will either happen through EventBus service, or will be sent directly to Kafka via the RC mediawiki extension. Need to consult with Mediawiki devs about best way to do this.

Ottomata created this task.Nov 30 2016, 7:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2016, 7:37 PM

It'll probably involve creating a class that implements the RCFeedEngine interface from MW core, registering that in wgRCEngines, then adding to our config's wgRCFeeds

It'll probably involve creating a class that implements the RCFeedEngine interface from MW core, registering that in wgRCEngines, then adding to our config's wgRCFeeds

Yep. And this would either be a KafkaRCFeedEngine or an HttpRCFeedEngine. KafkaRCFeedEngine would use kafka-php (already in mediawiki/vendor for EventRelayerKafka). HttpRCFeedEngine would post to the EventBus proxy.

I'm not sure which way we should go. Http would have the benefit of not having to configure Kafka addresses inside MediaWiki and would centralise our effort on one Kafka producer implementation (the one of EventBus, which would have high delivery satisfaction and good retry logic).

On the other side, it's a SPOF with one more layer that can fail - at which point it's quite hard to recover. Either way, it should probably be consistent with the EventBus extension. Also, what happened to EventRelayerKafka? (EventRelayer is how MediaWiki can broadcast CDN purges and Memcached deletes).

Nuria moved this task from Incoming to Q4 (April 2017) on the Analytics board.Dec 5 2016, 4:46 PM

what happened to EventRelayerKafka

Guess it hasn't been totally deployed?
https://phabricator.wikimedia.org/T134535

Ottomata added a comment.EditedDec 5 2016, 6:45 PM

HttpRCFeedEngine would post to the EventBus proxy.

Just looked into this a bit. One problem is that we need to augment the event before it is posted to the EventBus service. The EventBus service uses the meta.topic field to determine which Kafka topic the event should be produced to (and which schema it should use to validate the event). By the time the recentchanges event is handed to the FeedEngine subclass, it has already been passed through the configured formatter, so all we get is a string, instead of an object. I could re-parse the JSON string back into an object, but that seems dirty.

The EventBus extension sends all its other events via hooks. I could hook into the RecentChange_save hook and get the actual recentchanges object, and let the EventBus extension do the formatting and JSON string parsing like all other events there.

Thoughts?

Oh, I suppose I could make a special formatter for use with the EventBusFeedEngine that would do the parsing and augmentation of the object before it is serialized. Not sure which is better.

GWicke added a subscriber: GWicke.Dec 5 2016, 7:11 PM

Is there any information on how recentchanges information relates to the events that are already available in eventbus? At the very least, there seems to be a lot of overlap, with eventbus topics probably providing more details on the kinds of events it offers. On the other hand, I *believe* there are some kinds of events in RecentChanges that are not yet otherwise present in EventBus.

Long term, it seems that it would make sense to minimize duplication between these topics, and reuse as many of the detailed EventBus events as possible for the generation of the RecentChanges topic. What would it take to do so? Which issues do you see in curating a RecentChanges feed from the primary / detailed events?

Ottomata added a comment.EditedDec 5 2016, 7:19 PM

We need recentchanges in Kafka for easy backwards compatibility for Public EventStreams. We plan to deprecate and turn off RCStream in a quarter or two, and we need something to take its place. I agree that it would be better for folks to use the new events we have created, but we can't ask everyone to switch to a new service if it doesn't have what they are used to using. If in the future we can convince everyone to not use recentchanges, and can verify in EventStreams that it isn't being used, we could remove this data and stream then.

What would it take to do so? Which issues do you see in curating a RecentChanges feed from the primary / detailed events?

We already have the primary source of this stream and the ability to generate it in MediaWiki. I don't really see the benefit in cobbling together a new stream based on the new events, especially since we are only emitting recentchanges for backwards compatibility reasons.

GWicke added a comment.Dec 5 2016, 7:42 PM

We already have the primary source of this stream and the ability to generate it in MediaWiki. I don't really see the benefit in cobbling together a new stream based on the new events, especially since we are only emitting recentchanges for backwards compatibility reasons.

Is this the long-term plan, however? Is RecentChanges going to be removed again, and what would be its replacement?

There are efforts under way to add more information to RecentChanges, including things like ORES scores. So far it looks like at least some of these will require async computations, so there is already a need to compose a stream from several events & async-generated data. To me it seems natural to do this as part of the EventBus / ChangeProp infrastructure, and avoid a duplication of code paths for RecentChanges.

ORES probably should be converted to use revision-create, and the we could use change prop to generate an augmented revision-create stream with ORES scores. ORES is a service that WMF builds and maintains, so we should be able to help guide that in the short term. For community uses, it will be harder to get folks to switch quickly to a new schema.

GWicke added a comment.Dec 5 2016, 8:10 PM

@Ottomata, ORES pre-generation already uses revision-create. The returned information is not yet integrated into the RC feed though.

We really need to become clearer on what the use cases are, how RecentChanges addresses those, and what is missing / could be improved. I would expect current RC to already be fairly close to what most consumers need. I would not treat is just as a legacy format, but as a starting point for a gradual evolution of probably the most important feed provided publicly in EventStream.

The returned information is not yet integrated into the RC feed though.

Why would you integrate it into the RC feed, if ORES only cares about revisions? ORES doesn't need log events, does it? Wouldn't you just generate an augmented revision-create stream?

I would expect current RC to already be fairly close to what most consumers need.

Why did we create all the new events then? Should we have just used RecentChanges from the start, and not bothered with all of the detailed EventBus based events?

I would not treat is just as a legacy format, but as a starting point for a gradual evolution of probably the most important feed provided publicly in EventStream.

Are you arguing that RecentChanges format is the future, not the event schemas we have spent time bikeshedding and improving over the last year +?

Ottomata edited projects, added Analytics-Kanban; removed Analytics.Dec 6 2016, 3:17 PM
Ottomata set the point value for this task to 8.
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 325585 had a related patch set uploaded (by Ottomata):
Add mediawiki/recentchange event schema

https://gerrit.wikimedia.org/r/325585

Change 325588 had a related patch set uploaded (by Ottomata):
Move static helper functions from EventBus.hooks.php to EventBus.php

https://gerrit.wikimedia.org/r/325588

Change 325589 had a related patch set uploaded (by Ottomata):
Add EventBus RCFeed classes

https://gerrit.wikimedia.org/r/325589

Change 325588 merged by Ottomata:
Move static helper functions from EventBus.hooks.php to EventBus.php

https://gerrit.wikimedia.org/r/325588

The returned information is not yet integrated into the RC feed though.

Why would you integrate it into the RC feed, if ORES only cares about revisions? ORES doesn't need log events, does it? Wouldn't you just generate an augmented revision-create stream?

Revision-create is lower level than what most of our users want. For example, a move might create one or two revisions. The user might prefer to receive one "move" event instead.

I would expect current RC to already be fairly close to what most consumers need.

Why did we create all the new events then? Should we have just used RecentChanges from the start, and not bothered with all of the detailed EventBus based events?

I would not treat is just as a legacy format, but as a starting point for a gradual evolution of probably the most important feed provided publicly in EventStream.

Are you arguing that RecentChanges format is the future, not the event schemas we have spent time bikeshedding and improving over the last year +?

The target audiences / use cases for those streams overlap, but are not completely the same. The internal schemas are mostly lower level, provide a lot more detail, and can also include sensitive information (such as revision suppressions / deletions). The public RCStream feed merges a lot of these events into a single stream, omits sensitive information, and might add other information (ex: ORES scores) for the benefit of editors.

@GWicke, let's move this discussion over to T149736. This ticket is about getting the Mediawiki Recent Change feed into Kafka.

Change 325585 merged by Ottomata:
Add mediawiki/recentchange event schema

https://gerrit.wikimedia.org/r/325585

Change 325589 merged by Ottomata:
Add EventBus RCFeed classes

https://gerrit.wikimedia.org/r/325589

Change 332807 had a related patch set uploaded (by Ottomata):
Configure RCFeeds to use EventBus extension to send recentchange events

https://gerrit.wikimedia.org/r/332807

Change 332807 merged by Ottomata:
Configure RCFeeds to use EventBus extension in beta to send recentchange events

https://gerrit.wikimedia.org/r/332807

Change 334389 had a related patch set uploaded (by Ottomata):
Enable eventbus RCFeed in production and deployment-prep beta

https://gerrit.wikimedia.org/r/334389

Change 334389 merged by Ottomata:
Enable eventbus RCFeed in production and deployment-prep beta

https://gerrit.wikimedia.org/r/334389

Mentioned in SAL (#wikimedia-operations) [2017-01-31T18:46:18Z] <ottomata> recentchange events now flowing into Kafka via EventBus T152030

@Krinkle reports that page creation and deletion isn't making it through. Need to investigate...

Ottomata moved this task from Done to In Progress on the Analytics-Kanban board.Jan 31 2017, 9:35 PM
Pchelolo added a subscriber: Pchelolo.EditedJan 31 2017, 9:49 PM

There's also a bunch of 400 errors in the evenlogging-service logs, mostly something like

(MainThread) Failed processing event: Failed validating <Event fc094341-e7fe-11e6-bade-90b11c278532 of schema (u'mediawiki/recentchange', 1)>. None is not of type 'integer'

or

Failed processing event: Failed validating <Event 6176cb84-e7ff-11e6-b611-90b11c278a30 of schema (u'mediawiki/recentchange', 1)>. [162204055, 161455282, u'20170131214827'] is not of type 'object'

Makes sense, likely the schema we made isn't comprehensive enough. Will have a few minutes soon..

Change 335365 had a related patch set uploaded (by Ottomata):
Be more flexible with recentchange schema

https://gerrit.wikimedia.org/r/335365

Change 335365 merged by Ottomata:
Be more flexible with recentchange schema

https://gerrit.wikimedia.org/r/335365

Change 335368 had a related patch set uploaded (by Ottomata):
recentchange id needs to be nullable too

https://gerrit.wikimedia.org/r/335368

Change 335368 merged by Ottomata:
recentchange id needs to be nullable too

https://gerrit.wikimedia.org/r/335368

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Jan 31 2017, 10:32 PM
Nuria closed this task as Resolved.Feb 1 2017, 5:28 PM