EventBus MVP
Closed, ResolvedPublic
Actions

Description

Over in T88459, and in a few recent meetings, we've flushed out a sketch for to get standardized messages into Kafka for later consumption. We've coalesced on a way to move forward, and a MVP. This task will track the creation of the EventBus MVP.

Initial use cases

Provide edit related events (ex: edit, creation, deletion, revision deletion, rename). Initially, these events will be consumed by RESTBase / a change propagation service (T102476, T111819), as well as analytics / research. Potential uses include a purge service, RCStream, and push notifications.
EventLogging: Decode, validate and enqueue JSON events for EL.

Architecture Decisions

We will standardize on JSON Schema as our canonical schema spec, but do so in such a way that Avro can be used in Analytics type systems. Equivalent Avro Schemas may be generated as part of CI.
For MVP, JSON data will be produced to Kafka. We consider Avro Binary later.
There will be a Kafka Topic -> Schema mapping, and only that schema can be produced to a topic.

MVP Description

The MVP will consist of:

REST Service that validates JSON data against a schema and produces to Kafka.
Schema Repository Layout and Topic -> Schema mapping config that Service loads on startup.
A TBD implemented use case of this system.

Things we could consider after the MVP:

Schema review and CI processes:
- schema evolution rules
- Auto Avro schema generation
- Auto Avro java class generation
Schema metadata conventions (fields common to all schemas?)
Schema listing and discussion UI
- Integrate with on-wiki schema storage for EventLogging?
- Mediwiki Extension?

Other ideas

Schema lookup service

Details

	Subject	Repo	Branch	Lines +/-
	mwgrep: add '--title' arg	operations/puppet	production	+11 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	aaron	T97562 WANObjectCache relay daemon or mcrouter support
Resolved	Ottomata	T123954 Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters.
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	daniel	T102476 RFC: Requirements for change propagation
Resolved	• GWicke	T84923 Reliable publish / subscribe event bus
Resolved	Ottomata	T114443 EventBus MVP
Resolved	RobH	T114191 Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	Ottomata	T121553 setup kafka1001 & kafka1002
Resolved	• Cmjohnson	T121578 Rack 8 new misc servers
Resolved	elukey	T121558 setup kafka2001 & kafka2002
Resolved	RobH	T120885 codfw: rack 8 new misc systems
Resolved	• mobrovac	T116247 Define edit related events for change propagation
Resolved	Eevans	T116786 Integrate eventbus-based event production into MediaWiki
Declined	• csteipp	T120133 security review of ramsey/uuid
Resolved	• csteipp	T120212 Security review of EventBus extension
Resolved	• GWicke	T120409 RESTBase should honor wiki-wide deletion/suppression of users
Resolved	ssastry	T125266 Remove user name and edit comment from html <head>
Resolved	• Pchelolo	T122079 Update EventBus extension to produce User-block events
Resolved	Ottomata	T122077 Define schema for a User-block event
Resolved	Ottomata	T118578 Package EventLogging and dependencies for Jessie
Resolved	Ottomata	T118761 Move EventLogging/server to its own repo and set up CI
Resolved	Ottomata	T118780 Puppetize eventlogging-service
Resolved	• madhuvishy	T118903 Make eventlogging logs configurable via python config file [5 pts] {oryx}
Resolved	Ottomata	T118863 Deploy eventlogging from new repository [5 pts]
Resolved	• madhuvishy	T118869 Send HTTP stats about eventlogging-service to statsd [3 pts]
Resolved	Ottomata	T121112 Build tornado-sprocket python packages
Resolved	• mobrovac	T128463 New Service Request - Change Propagation
Resolved	• mobrovac	T130948 Scap3 promote stage not working

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T114443#1731399, @ori wrote:

In T114443#1731284, @GWicke wrote:

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

Could you explain how you arrived at the figure of 50k requests per second, which you project for this service?

This is @Ottomata's projection for analytics use cases. For core events, throughput should be of a lesser concern as rates will likely be in the low hundreds of messages per second.

In T114443#1731284, @GWicke wrote:
In T114443#1730753, @Eevans wrote:

Already leverages a (really slick) JSON schema registry

Optionally fetching schemas from a URL isn't that hard really. Example code:
if (/^https?:\/\//.test(schema)) {
  return preq.get(schema);
} else {
  return readFromFile(schema);
}
This lets us support files for core events, and fetching schemas from meta for EL. Schema validation is a call to a library.

The main reason that I listed this as a benefit, is because I don't understand why we need to distinguish between classes of events in this way (at the architectural level). Since EL already has an answer for schema registry, it seemed like an advantage.

However, if we assume that we need an additional class of in-tree schemas, then the inverse is also true; It would be just as trivial to implement reading from the filesystem.

Provides a pluggable, composable, architecture with support for a wide range of readers/writers

How would this be an advantage for the EventBus portion? Many third-party users will actually only want a minimal event bus, and EL doesn't seem to help with this from what I have seen.

For starters, it means that we have alternatives for environments where Kafka is overkill (small third-party installations, dev environments, mw-vagrant, etc). Using, for example, sqlite instead of Kafka is already something supported.

There is also a tremendous amount of flexibility here, and even if we assume that we need none of that now, it's impossible to assume we never will. Having the ability to compose arbitrary event stream topologies, from/to a wide variety of sources/sinks, multiplex, and add in-line processing, sounds like a great set of capabilities to base such a project on.

schema registry availability

There are more concerns here than just availability (although that's important, too).

Third party users won't necessarily want to give their service access to the internet in order to fetch schemas. We need to provide a way to retrieve a full set of core schemas, and a git repository is an easy way to achieve this.

Third parties could use our schema registry, or use the same extension we do, to host one of their own. Or, (as mentioned elsewhere), we could export snapshots of the relevant schemas via CI to ship along side the code (this seems safe, as a revision is immutable).

We also need proper code review and versioning for core schemas, and wikis don't really support code review. We could consider storing pointers to schemas (URLs) instead of the actual schemas in git, but this adds complexity without much apparent benefit:

I would say that both versioning and review are well covered here. I get your point that it's not as specialized as code review tooling might be, but wikis are an established means for collaboration.

Workflow with schemas in git:

create a patch with a schema change

code review

Workflow with pointers to schemas (URLs) in git:

save a new schema on meta; note revision id

create a patch with a schema URL change

code review

That doesn't seem too onerous to me.

For performance, it needs to be Good Enough(tm), where Good Enough should be something we can quantify based on factors like latency, throughput, and capacity costs that aren't prohibitively expensive when weighed against other factors (e.g. engineering effort).

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

I always find these things difficult to quantify. There are so many variables. If hypothetically speaking, it only saved us a week, what is that worth? What could we do with another week (lost opportunity costs)?

Also, how do you quantify the value of using a piece of software that other teams are already using? Where you have a wider set of active developers, and more eyes on it? Where ops is already familiar with it?

I don't pretend to know the answers to these.

gerritbot subscribed.Oct 16 2015, 6:38 PM

This comment was removed by ori.

gerritbot added a project: Patch-For-Review.Oct 16 2015, 6:38 PM

gerritbot added a comment.Oct 16 2015, 6:41 PM

This comment was removed by ori.

• mobrovac removed a project: Patch-For-Review.Oct 16 2015, 6:49 PM

• mobrovac removed a subscriber: gerritbot.

For starters, it means that we have alternatives for environments where Kafka is overkill (small third-party installations, dev environments, mw-vagrant, etc). Using, for example, sqlite instead of Kafka is already something supported.

As far as I can see, there is no support for using any database as a queue / log in a way that would give us a light-weight alternative to Kafka. I see no support for streaming from a database in EventLogging, and separate tables are created whenever a schema is changed.

So, we'll have to implement this either way. We do have fairly nice async table abstractions for sqlite and cassandra that we could reuse for this in node. Both already implement retention policies. Python has sqlalchemy, which is a pretty nice way to interface with dbs. Retention policies would have to be implemented manually.

Another consideration is that the EventLogging Python code is synchronous, while the node code is async. Efficiently supporting many concurrent streaming clients will likely be difficult using the EL code.

A PR adding remote schema support to the nodejs frontend is now available at https://github.com/wikimedia/restevent/pull/1. This means that we can now choose to use local or remote schemas per-topic in the configuration.

Hey yalls,

I've had requests that we postpone the RFC for this one more week, until Oct 28th. I'd like for one opsen and @ori to be able to attend, and the relevant opsens are all traveling, and Ori can't make this one either.

So, we need to be really careful here. This MVP as of yet has zero buy in from anyone in ops. In addition, both @ori and @Eevans point out that EventLogging already does everything that this MVP encompasses, minus the HTTP service part. Now it is time for me to chime in too, woowee!

Could you explain how you arrived at the figure of 50k requests per second, which you project for this service?

This is just an arbitrary goal, some number we came up with. I'd like to be able to encourage developers to use EventBus for everything they can think of.

We've scaled EventLogging to about 10k / second by using Kafka, but that is only on a single node. EventLogging is horizontally scalable. Need more throughput? Add more partitions and processors.

In addition, the EventLogging processors are doing more than just validating JSON messages. They are parsing the JSON data out of encoded query strings via regexes, wrapping the incoming event data with generic metadata, anonymizing IP addresses using a shared rotating salt key from etcd, sending invalid events off to Kafka as EventErrors, etc. etc.

We also need proper code review and versioning for core schemas, and wikis don't really support code review. We could consider storing pointers to schemas (URLs) instead of the actual schemas in git, but this adds complexity without much apparent benefit:

I think this is true, especially for the 'production' use case of EventBus. EventLogging was originally designed for analytics use cases, some of which are short lived one-off's (A/B testing, whatever). Making quick changes via a wiki is awesome for this. Having more control over changes to production schemas sounds like a good idea.

However, if we assume that we need an additional class of in-tree schemas, then the inverse is also true; It would be just as trivial to implement reading from the filesystem.

Agreed, it would trivial to add filesystem based schemas to EventLogging. In fact, this is sort of already done, via the cached schema system. Schemas needed for unit testing are hardcoded into the source and manually inserted into the in memory schema cache. We could do the same thing with a filesystem tree of schemas: preload them into the in memory cache. When asked to validate a schema from the filesystem, EventLogging wouldn't even bother trying to reach out to meta, since it would already be in the in memory cache. See: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/schema.py#L64

Provides a pluggable, composable, architecture with support for a wide range of readers/writers

How would this be an advantage for the EventBus portion? Many third-party users will actually only want a minimal event bus, and EL doesn't seem to help with this from what I have seen.

It does, no? EventLogging is already a useable extension with third party Mediawiki installations. Kafka isn't needed to use EventLogging at all.

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

As mentioned here, here, here, here, and here, comparing the performance now is interesting, but provides little insight as to how this system will perform in the real world with more features. EventLogging is doing more than just accepting a JSON message and validating it against a schema. In any case, this system will need to be horizontally scalable. As noted, the production use case will be much lower volume than the analytics one. The performance of all the solutions we evaluated in T88459 is suitable for the production use case, especially since they are all horizontally scalable.

Smalyshev moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.Oct 19 2015, 10:01 PM

tstarling moved this task from Request IRC meeting to Under discussion on the TechCom-RFC board.Oct 21 2015, 9:38 AM

JanZerebecki moved this task from incoming to hold on the Wikidata board.Oct 21 2015, 12:48 PM

• Spage moved this task from Under discussion to Request IRC meeting on the TechCom-RFC board.Oct 21 2015, 8:19 PM

We are having a hangout meeting tomorrow (Thursday, 22nd) between 11&12am SF time. Please let us know if you'd like to join.

Task: T116247: Define edit related events for change propagation

Agenda:

The EventBus MVP~[1] is moving along and we can now validate and enqueue messages~[2]. The next step is to define the shape of the schemas against which the messages ought to be validated. Aaron's event definitions seem to be a great starting point for the discussion.

[1] https://phabricator.wikimedia.org/T114443
[2] https://github.com/wikimedia/restevent
[3] https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource

@GWicke I would be interested to participate. I'll be in the office, could you add me to the invite?

In T114443#1744752, @Smalyshev wrote:

@GWicke I would be interested to participate. I'll be in the office, could you add me to the invite?

Done.

Ottomata added a subtask: T114191: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams.Oct 22 2015, 7:04 PM

• mobrovac added subtasks: T116247: Define edit related events for change propagation, T116786: Integrate eventbus-based event production into MediaWiki.Oct 27 2015, 6:24 PM

intracer subscribed.Oct 27 2015, 10:37 PM

• brooke subscribed.Oct 28 2015, 8:29 PM

• RobLa-WMF subscribed.Oct 30 2015, 1:13 AM

Today's EventBus RFC discussion ended with the general consensus that we will implement this project in EventLogging.

This is a large project that is meant to be useful to many teams. Whatever the final implementation, there will need to be a path forward for deprecating existing use cases (EventLogging, RCStream, etc.) in favor of this system.

Given this project's cross-team generality, and the need to port existing use cases , and barring any practical or technical reasons not to, we will adapt EventLogging to include an HTTP REST Service.

We may need to conform some of our schema designs to EventLogging, and there may be other unknowns that we will discover as work on the implementation.

Ottomata added a project: Analytics-Kanban.Nov 2 2015, 5:00 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

@Ottomata: In my recollection of the discussion & the log you linked to, the question of which REST producer proxy to use was left open. Our priority is to get basic events into Kafka before the end of this month, so that we can start building on top of this for change propagation. We still haven't finalized the event definitions & still need to tackle the MediaWiki integration, so there isn't really a lot of time left. We have a simple node service that does what we need & integrates with our node infrastructure, but if you have something based on EventLogging soon then we can consider using that too. Lets just make sure that the APIs are compatible & make sense in the longer term.

FWIW, one does not exclude the other: the EL-based service can be used in production, while the node-based REST proxy may be used for development and/or small installs.

In my recollection of the discussion & the log you linked to, the question of which REST producer proxy to use was left open.

I think you may be referring to the first link of the meetbot notes, which was ended before we stopped discussing. Starting at [19:17:26] <robla> in the chat logs, it seems clear to me that the consensus is that unless there are good reasons to ditch something that already does most of what this project is about, then we should adapt what we are already using. If I'm mistaken, please correct me.

if you have something based on EventLogging soon then we can consider using that too. Lets just make sure that the APIs are compatible & make sense in the longer term.

Getting closer here, need some help on the API big time. Will also need to revisit some meta schema design things over on T116247 to make things easier for EventLogging.

Our priority is to get basic events into Kafka before the end of this month, so that we can start building on top of this for change propagation

@GWicke, I think this may be a problem. From my perspective, the goal of this project is a generalized event service with well designed and standardized schemas for all of WMF. For this MVP, we have chosen to model change events because that is what you are interested in. This is an 'MVP', and will likely require iteration after the first deployment. I don't think having a live services production goal based on this is realistic.

FWIW, one does not exclude the other: the EL-based service can be used in production, while the node-based REST proxy may be used for development and/or small installs.

That is one of the reasons for sticking with EventLogging. It is already useable by small installs without Kafka.

Ok, still various TODOs around the code, but this is ready for review.
https://gerrit.wikimedia.org/r/#/c/235671

There are concepts that it'll be good to do close review with folks familiar with EventLogging (probably @Nuria or @Milimetric), and other that I'd really like services folks to look at @mobrovac and/or @Eevans?

In T114443#1777558, @Ottomata wrote:

@GWicke, I think this may be a problem. From my perspective, the goal of this project is a generalized event service with well designed and standardized schemas for all of WMF. For this MVP, we have chosen to model change events because that is what you are interested in. This is an 'MVP', and will likely require iteration after the first deployment. I don't think having a live services production goal based on this is realistic.

I don't see these two as being mutually-exclusive. In order to meet the end goal of a generalised event service we are starting with the Services' use case. The MVP is part of one of our quarterly goals. We have almost finalised the events and almost settled on the hardware, so from our point of view we are ready to start building our change propagation system which relies on the basic edit events.

And, as you state, the MVP is going to be a first stab at this (a prototype of sorts) which will be improved upon as other services/systems are ported to it. Which points make you think having an MVP up and running this quarter?

I don't see these two as being mutually-exclusive. In order to meet the end goal of a generalised event service we are starting with the Services' use case. The MVP is part of >one of our quarterly goals. We have almost finalised the events and almost settled on the hardware, so from our point of view we are ready to start building our change >propagation system which relies on the basic edit events.

I sure hope we are not thinking of having a node rest endpoint and another one based on eventlogging at the same time, More than for technical reasons because it really makes me think that we cannot collaborate. After our irc meeting of last Friday It was clear that the majority of attendees favor a solution based on EL and i was under the impression this is what we were going for.

In T114443#1782101, @Nuria wrote:

I sure hope we are not thinking of having a node rest endpoint and another one based on eventlogging at the same time, More than for technical reasons because it really makes me think that we cannot collaborate.

I was talking about the consumer side of things in my previous post, not about the producer side. Our QG is about the whole pipeline: (a) producer; (b) kafka cluster; and (c) change propagation system as a consumer. The way I see it, we have discussed / worked on (a) and (b), so now we'd like to get started on (c) in order to meet our goal.

After our irc meeting of last Friday It was clear that the majority of attendees favor a solution based on EL and i was under the impression this is what we were going for.

As for the python vs node REST proxy discussion, they are supposed to be functionally equal, i.e. they should be interchangeable. Given that the node proxy is ready to use, I don't see harm in us using, in this way allowing us to complete this quarter's goal.

That makes me think - how would the python service be deployed? Does that need some extra (puppet) work? In the case of the node proxy, that's a matter of writing 5 lines in ops/puppet.

@mobrovac so let me get this straight, we discussed something that was already overridden by an existing implementation?

As far as deploying the python app, who is working on it? I think I can help with deployment/development of the server glue.

In T114443#1782215, @Joe wrote:

@mobrovac so let me get this straight, we discussed something that was already overridden by an existing implementation?

That's right. There's a node implementation which is ready to be used, and there's a WIP python effort. They both strive to fulfil the MVP's requirements from this task's description. The former was created by the Services team as a quick-start solution (not implying this entails lesser quality). The latter is envisioned to be part of the EL codebase, even though, from what I can tell, it does not interact with EL directly (please correct me if I'm wrong).

So, yeah, we have two things doing the same thing because of the question should we reuse (parts of) EL for the event bus system?

In T114443#1776521, @GWicke wrote:

We have a simple node service that does what we need & integrates with our node infrastructure, but if you have something based on EventLogging soon then we can consider using that too.

So either someone else should make it for you (soon) or you'll just use your own thing? No, it doesn't work like that. The entire point of the RFC meeting was so that we could all agree to what we want out of this and find acceptable, make our compromises and make a decision about the direction that we'll go forward to.

This happened, and we first and foremost widely agreed of this being aimed as a single product that will "unify the set of partial and divergent implementations that currently exist". I don't think you've proposed (yet?) a plan for replacing EventLogging for all of its existing use cases — if you do so, we can have that conversation based on those merits. Until then, I don't see why we are even discussing this "node implementation".

Or in another words: the flip side of what you wrote is "we have a complicated piece of infrastructure that has been worked on for years, is battle tested and is actively used for a number of different use cases already — but if you can make restevent reach feature parity with that system soon then we can consider using that too".

In T114443#1782700, @faidon wrote:

So either someone else should make it for you (soon) or you'll just use your own thing? No, it doesn't work like that. The entire point of the RFC meeting was so that we could all agree to what we want out of this and find acceptable, make our compromises and make a decision about the direction that we'll go forward to.

This happened, and we first and foremost widely agreed of this being aimed as a single product that will "unify the set of partial and divergent implementations that currently exist". I don't think you've proposed (yet?) a plan for replacing EventLogging for all of its existing use cases — if you do so, we can have that conversation based on those merits. Until then, I don't see why we are even discussing this "node implementation".

Or in another words: the flip side of what you wrote is "we have a complicated piece of infrastructure that has been worked on for years, is battle tested and is actively used for a number of different use cases already — but if you can make restevent reach feature parity with that system soon then we can consider using that too".

Soooo, I think there's a mix of short-term needs and long-term requirements which do not go hand in hand and we seem to be juggling mostly around them.

Here's the deal the way I see it. Yes, sure, +1k for:

we first and foremost widely agreed of this being aimed as a single product that will "unify the set of partial and divergent implementations that currently exist"

That's the long-term plan. As we agreed in the meeting, not everything can be converted now or soon. What we (=== Services team) have committed on doing this quarter is creating the change propagation system which aims at replacing the (hacky) RestbaseUpdateJobs extension. And that is only a first use case that is to be based on the EventBus MVP outlined in this task. Since the node REST proxy is ready to use, we feel we should use that in the interim so that we can continue work on our goal. To be explicit: I'm not saying we're dismissing the RFC discussions and don't want to collaborate with others. Our ultimate goal is exactly what you described - a unified event bus system for the whole organisation - and only an org-wide consensus will bring us home. But we have to make (small-ish) compromises in the short term in order to meet our QG.

• Spage moved this task from Request IRC meeting to Under discussion on the TechCom-RFC board.Nov 4 2015, 9:43 PM

@faidon: Until very recently (last days), there wasn't actually an EventBus-like REST proxy with schema validation in the EventLogging repository. @Ottomata now has a patch implementing such a service, and @mobrovac has left comments on it today. So, it looks like we'll have the option of choosing between two new services implementing the same API. I don't see having two implementations of a simple service as a bad thing. As mentioned, we might want to use a single node process exposing parsoid, restbase & eventbus for small (third party) installs, but might as well use the new EventLogging service in production.

There are still loose ends to be tied in the API and event schema definitions, and I think that should be our focus. The implementation deserves attention too, but it's easy to swap & a few hundred lines each.

Replacing all of EventLogging is pretty much out of scope for EventBus. The focus is on queuing and event validation, and not on other EventLogging features like Varnish log decoding, analytics databases etc. If desired, we could fairly easily add HTTP event production in EL, which would write to EventBus instead of directly to Kafka. However, I personally think it's fine to let trusted producers write directly to Kafka, especially for internal applications. The current EL instance is producing to a separate (analytics) Kafka cluster in any case, so there is no potential for conflicts with non-analytics use cases.

Until very recently (last days), there wasn't actually an EventBus-like REST proxy with schema validation in the EventLogging repository.

Not quite true, this was started Sept 3.
https://phabricator.wikimedia.org/T88459#1601022

More importantly, I don't understand why this is something Andrew has to do (and "soon") and not the services team "or else".

Why is it a given that the Services team is going to exclusively work on their choice of tech and, if consolidation is required, someone else must adapt their world to yours (and make it "a joint effort") to achieve that?

As mentioned, we might want to use a single node process exposing parsoid, restbase & eventbus for small (third party) installs, but might as well use the new EventLogging service in production.

To date we do not have a third party install small use case but rather an internal production one (edit stream) so let's focus on this one and thus let's focus on adapting the EL change.

I don't see having two implementations of a simple service as a bad thing.

I certainly disagree , I could see how we could have (for testing) a node mockup service for rest endpoint for -for example- a vagrant role, but I cannot see two full fledged systems doing the same thing as a positive outcome, rather it signals to me " duplication of effort" .

What we (=== Services team) have committed on doing this quarter is creating the change propagation system which aims at replacing the (hacky) RestbaseUpdateJobs extension. A

A quaterly goal is not good for itself, it is a means to provide value to the organization. In this case duplicating efforts is producing technical debt and a lot of friction.

Hi all, I talked to @GWicke a little bit more about this last Thursday. He impressed upon me a couple of good points I hadn't fully taken in before, and I want to recognize them.

Simplicity - Services is concerned that the REST service needs to be very reliable and not buggy. restevent is simple, and if it never needs to do anything beyond this MVP, it will not need many changes or deploys throughout its lifetime. Conversely, EventLogging is a codebase that is often worked on and improved. Using this established and more featureful codebase provides a lot of benefit, but brings with it risk of instability due to changes.

EventLogging deprecation - A big concern expressed at the RFC was proliferations of systems and the effort needed to port old systems over to EventBus if we were to use restevent. EventBus is about 2 things: standardizing WMF events, and getting valid events into a pub/sub for many consumers to use. The scope of this MVP does not include consumption of events, and much of the EventLogging codebase is about consuming, not producing. Using restevent does not mean that EventLogging will be deprecated. EventLogging does much more than restevent ever would. There will be other systems that will need to be ported to EventBus, but this is true independent of the EventLogging vs restevent discussion.

That said, I still think we should move forward with EventLogging as we have been and as the general consensus in the RFC indicated. I have discussed these points with some folks, and even though they are valid concerns, I don't think that they outweigh the pros of using and improving an established working system. The risk of instability due to active development of EventLogging can be addressed with common release management practices. I.e. we can version well and deploy only stable releases to the HTTP service. And even though we wouldn't need to port EventLogging to use restevent, there is duplication of effort here, as EventLogging is built to do most of what this MVP is about.

Also, I'd like to note that building an HTTP produce service that fits this MVP in EventLogging is more work than is in restevent. This is mostly due to the schema constraints (I.e. EventCapsule) that EventLogging was originally built with. The work we are doing to make EventLogging work with more generic meta data is valuable beyond just this MVP, so I think it is worth it.

To be explicit: I'm not saying we're dismissing the RFC discussions and don't want to collaborate with others. Our ultimate goal is exactly what you described - a unified event bus system for the whole organisation - and only an org-wide consensus will bring us home. But we have to make (small-ish) compromises in the short term in order to meet our QG.

I think this is one of the sources of conflict. I originally proposed a very generic solution to an org wide problem, and Analytics has committed to work on the initial infrastructure MVP that solves the generic problem this quarter. Services wants to use the solution that solves the org wide problem, but has additionally committed to an additional goal that depends on the generic solution. Services wants something that works for them now, and will make it work for others later. Analytics is interested in the more generic problems first.

Services goal + the fact that EventLogging is more work is worrysome for Services. I believe that we can get EventLogging ready in time to meet Service's needs, but I'm not excited about promising it. As a back up and with Ops' go-ahead, I think it would be ok to use restevent as a stand in until EventLogging is ready, especially for the month of December so that Services can continue to work on their goals.

@GWicke, @mobrovac, @Eevans, and @Nuria, I propose we set up a twice-a-week-EventBus-standup-sync-up-party-meeting to help us better collaborate and be in sync. How about Monday and Thursdays?

In T114443#1793775, @Ottomata wrote:

Hi all, I talked to @GWicke a little bit more about this last Thursday. He impressed upon me a couple of good points I hadn't fully taken in before, and I want to recognize them.

Simplicity - Services is concerned that the REST service needs to be very reliable and not buggy. restevent is simple, and if it never needs to do anything beyond this MVP, it will not need many changes or deploys throughout its lifetime. Conversely, EventLogging is a codebase that is often worked on and improved. Using this established and more featureful codebase provides a lot of benefit, but brings with it risk of instability due to changes.

EventLogging deprecation - A big concern expressed at the RFC was proliferations of systems and the effort needed to port old systems over to EventBus if we were to use restevent. EventBus is about 2 things: standardizing WMF events, and getting valid events into a pub/sub for many consumers to use. The scope of this MVP does not include consumption of events, and much of the EventLogging codebase is about consuming, not producing. Using restevent does not mean that EventLogging will be deprecated. EventLogging does much more than restevent ever would. There will be other systems that will need to be ported to EventBus, but this is true independent of the EventLogging vs restevent discussion.

I'd also add to this list the out-of-the-box support for:

worker monitoring / automatic restarting
easy configuration
logging and metrics support
easy and quick deployment in production

I think this is one of the sources of conflict. I originally proposed a very generic solution to an org wide problem, and Analytics has committed to work on the initial infrastructure MVP that solves the generic problem this quarter. Services wants to use the solution that solves the org wide problem, but has additionally committed to an additional goal that depends on the generic solution. Services wants something that works for them now, and will make it work for others later. Analytics is interested in the more generic problems first.

We are too, but we need the change-propagation system not only because it's our QG, but also because it allows us to continue our work in the services segment (most notably, pre-generation for back-end services).

@GWicke, @mobrovac, @Eevans, and @Nuria, I propose we set up a twice-a-week-EventBus-standup-sync-up-party-meeting to help us better collaborate and be in sync. How about Monday and Thursdays?

Having weekly meetings seems like a good idea. How about we start once per week and take it from there?

Sounds good. Shall I just find a time and set one up?

Just made a calendar event for Tuesday at 10:30 PST. Happy to move it if some other time is better.

Ottomata closed subtask T118578: Package EventLogging and dependencies for Jessie as Resolved.Nov 16 2015, 10:01 PM

Ottomata closed subtask T118761: Move EventLogging/server to its own repo and set up CI as Resolved.Nov 17 2015, 4:13 PM

ori mentioned this in T118162: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections.Nov 18 2015, 10:39 AM

Glaisher subscribed.Nov 19 2015, 4:05 AM

• kevinator closed subtask T118863: Deploy eventlogging from new repository [5 pts] as Resolved.Nov 19 2015, 5:08 PM

Eevans mentioned this in T120242: Eventually Consistent MediaWiki State Change Events.Dec 3 2015, 5:21 PM

• Deskana moved this task from Needs triage to Tracking on the Discovery-ARCHIVED board.Dec 3 2015, 7:20 PM

Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.Dec 7 2015, 6:08 PM

• madhuvishy moved this task from In Progress to Paused on the Analytics-Kanban board.Dec 9 2015, 9:07 PM

Milimetric removed a project: Analytics.Dec 14 2015, 6:17 PM

RobH closed subtask T114191: Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams as Resolved.Dec 16 2015, 9:29 PM

Ottomata closed subtask T118869: Send HTTP stats about eventlogging-service to statsd [3 pts] as Resolved.Dec 18 2015, 8:38 PM

Ottomata closed subtask T118780: Puppetize eventlogging-service as Resolved.Dec 22 2015, 5:05 PM

Will the MVP include being publicly accessible, i.e. anyone on the Internet can run a consumer?

In T114443#1900334, @JanZerebecki wrote:

Will the MVP include being publicly accessible, i.e. anyone on the Internet can run a consumer?

I suspect not, although I do hope the architecture is designed in such a way that websocket proxies (or even straight kafka proxies, with authentication!) are easy to setup :)

No, consumption is not part of the MVP.

There may be future work to make consumption from Kafka via websockets easy to set up, but we will not make any Events public by default. We will have to set up special endpoints for approved event streams.

Milimetric mentioned this in T112956: Developer summit session: Pageview API from the Event Bus perspective.Dec 24 2015, 3:14 PM

• mobrovac closed subtask T116247: Define edit related events for change propagation as Resolved.Jan 12 2016, 7:21 PM

Smalyshev mentioned this in T86965: Wikidata query - investigate rcstream for push based update notifications.Jan 13 2016, 9:15 PM

Ottomata moved this task from Backlog to In Progress Before Value Streams Kickoff (August 15th) on the Event-Platform board.Feb 1 2016, 4:58 PM

Milimetric edited projects, added Analytics; removed Analytics-Kanban.Feb 8 2016, 6:00 PM

Milimetric moved this task from Analytics Query Service to Radar on the Analytics board.

• RobLa-WMF mentioned this in T125865: Assign RFCs to ArchCom shepherds.Feb 10 2016, 8:15 PM

We will resolve this after T120212 is closed, and after we have the first consumer (change propagation) in production.

• RobLa-WMF added a subtask: T120212: Security review of EventBus extension.Feb 29 2016, 10:53 PM

• mobrovac mentioned this in T128463: New Service Request - Change Propagation.Mar 1 2016, 1:54 PM

• mobrovac edited subtasks, added: T128463: New Service Request - Change Propagation; removed: T120212: Security review of EventBus extension.Mar 1 2016, 1:56 PM

@mobrovac - I'm confused, why don't you think T120212 is a blocker for this?

In T114443#2076614, @RobLa-WMF wrote:

@mobrovac - I'm confused, why don't you think T120212 is a blocker for this?

It is, but it's an indirect one: it is blocking T116786: Integrate eventbus-based event production into MediaWiki which is a blocker for this task and whose resolution depends solely on T120212 .

I realize that the blocking relationship is transitive, but given Otto's comment (T114443#2072426), it would seem that it would be clearer to make the blocking relationship specific, rather than obscuring it in a hierarchy. Would you mind if I put T120212 as a direct blocker for this task?

In T114443#2077418, @RobLa-WMF wrote:

I realize that the blocking relationship is transitive, but given Otto's comment (T114443#2072426), it would seem that it would be clearer to make the blocking relationship specific, rather than obscuring it in a hierarchy. Would you mind if I put T120212 as a direct blocker for this task?

Nope. {{done}}

• mobrovac closed subtask T128463: New Service Request - Change Propagation as Resolved.Apr 1 2016, 10:07 AM

• csteipp closed subtask T120212: Security review of EventBus extension as Resolved.Apr 27 2016, 7:00 PM