Page MenuHomePhabricator

Bikeshed what events should be exposed in public EventStreams API
Closed, ResolvedPublic0 Story Points

Description

EventStreams is moving along, and we need to figure out what streams of events (other than recentchanges, this will be exposed for sure) should be exposed in the public API.

I had previously just considered exposing as much as we can, but there may be reasons to not to so (redundancy of data API endpoints is one of them).

In Kafka now, we currently have available:

  • revision-create
  • revision-visibility-change
  • page-move
  • page-delete
  • page-undelete
  • page-properties-change
  • resource-change
  • user-blocks-change

As well as more. The schemas for these events are defined at https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki. Should we include all or some of these? Should we somehow compose these (via change-prop) into different event streams with different schemas altogether (e.g. an edit stream?).

Event Timeline

Ottomata created this task.Nov 1 2016, 7:50 PM

@Ottomata Is recentchanges not yet included?

Not yet, haven't had time.

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 2 2016, 3:13 PM
Glaisher removed a subscriber: Glaisher.Nov 2 2016, 3:43 PM
Ottomata moved this task from In Progress to Paused on the Analytics-Kanban board.Nov 8 2016, 4:02 PM

Bringing some comments from @GWicke over from T152030:

Revision-create is lower level than what most of our users want. For example, a move might create one or two revisions. The user might prefer to receive one "move" event instead.

I would not treat is just as a legacy format, but as a starting point for a gradual evolution of probably the most important feed provided publicly in EventStream.

Are you arguing that RecentChanges format is the future, not the event schemas we have spent time bikeshedding and improving over the last year +?

The target audiences / use cases for those streams overlap, but are not completely the same. The internal schemas are mostly lower level, provide a lot more detail, and can also include sensitive information (such as revision suppressions / deletions). The public RCStream feed merges a lot of these events into a single stream, omits sensitive information, and might add other information (ex: ORES scores) for the benefit of editors.

In that ticket, Gabriel and I were mostly arguing about whether we should expose RecentChanges as is, or make a new stream that is composed of the newer more detailed mediawiki event-schemas events that flow through eventbus. There is some confusion, at least from me, because I had expected all along that we would eventually expose the eventbus events as they are.

I think this is a healthy conflict between the desire to have clean (and not redundant) APIs, and the Analytics team's mission to make as much data available to the public as is safe. The eventbus events are cleaner and more detailed than RecentChanges, and have been designed to not contain private data. The eventbus events may not cover all the same events yet, but they could. I had been looking at RecentChanges as the legacy format of presenting Mediawiki change streams, and the eventbus events as the newer and more desirable format. I had thought others felt the same, but perhaps not!

I guess we need to find out. An Analytics Q3 goal is to launch EventStreams and announce the RCStream deprecation. In order to do this, we need to have the RecentChange stream as it is for backwards compatibility. But EventStreams can and should expose more than just RecentChanges. If it shouldn't expose revision-create and other eventbus events, then what should it expose?

@Nuria and I chatted a bit about this the other day. We want to move forward with the EventStreams launch, but without the eventbus events exposed (or not announcing them...perhaps we would expose revision-create if it is useful for ORES, need to talk with @Halfak). In the announcement, we'd then solicit for opinions about new data streams that would be useful for the community. Questions like: do you want all change events in the same stream? Are there mediawiki event-schemas that exist as is that you want now? Are there new ones we should make, or combine? Etc.

BTW, we can make EventStreams consume from multiple topics at once, so if we want to present a unified stream (of different schemas), that is possible now.

Gabriel raised the question of deletions and suppressions. Currently, all revisions are exposed by both existent public RecentChanges feeds. Future deletes can't erase the fact that people will have already consumed those revisions (without content, of course). Even so, Gabriel brings up the fact that EventStreams makes historical data consumable as far back into the past that we keep data in Kafka (right now: 7 days). There are some idea around allowing arbitrary timestamp based consumption by storing data indefinitely, but that might not be possible if we want to prevent consumption of deleted revision at least after they are deleted. Anyway, we won't be doing this anytime soon, but if we wanted to, need to do more research (e.g. does Kafka log compaction help us?).

Ottomata moved this task from Paused to In Progress on the Analytics-Kanban board.Dec 8 2016, 4:05 PM
Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Dec 20 2016, 8:20 PM

Per our Dev Summit meeting and plan (ReviewStream = review-stream-revision-create + log-events), we will also need T155804: log-events topic emitted in EventBus.

In theory, neither log-events nor review-stream-revision-create needs to be public for Collaboration team purposes.

But I think @Ottomata is considering making everything that can be public actually public, which means log-events and review-stream-revision-create would be new for this list.

The reason we may want EventBusWikiChangeEventsNewInfra (hypothetical replacement for RCStream on the new infrastructure, with no ORES), is that we think there may be a use case for tools that want to get revision data without waiting for ORES.

For example, imagine a tool like ClueBot, that does its own separate AI. I don't know if they actually use RCStream, but the use case of not wanting to wait for ORES seems valid.

Nuria added a comment.Jan 20 2017, 3:57 PM

+1 to @Mattflaschen-WMF comment. There is also an operational argument. Seems that RCstream is a tier-1 "service" (or rather, data feed). Without it several bots cannot function properly. I am not sure that ORES however is supported to a tier-1 level so we should definitely have a feed that does not rely on ORES data.

I'd use this for precaching scores in our experimental deployment of ORES. We currently use production ChangeProp to keep production ORES up to date. I'd like to see production-like ChangeProp in labs. I'll be happy with just revision-create for use in ORES for this use-case right now.

I think it certainly makes sense to have a separate event for revision-scored or something like that.

Nuria edited projects, added Analytics; removed Analytics-Kanban.May 25 2017, 4:07 PM
Nuria moved this task from Incoming to Radar on the Analytics board.May 25 2017, 4:14 PM
Krinkle moved this task from Inbox to Backlog on the Wikimedia-Stream board.Jun 22 2017, 8:34 PM
Ottomata closed this task as Resolved.Mar 28 2018, 8:56 PM

Open too long, we mostly good here. :)