Page MenuHomePhabricator

EventBus MVP
Closed, ResolvedPublic

Description

Over in T88459, and in a few recent meetings, we've flushed out a sketch for to get standardized messages into Kafka for later consumption. We've coalesced on a way to move forward, and a MVP. This task will track the creation of the EventBus MVP.

Initial use cases

  1. Provide edit related events (ex: edit, creation, deletion, revision deletion, rename). Initially, these events will be consumed by RESTBase / a change propagation service (T102476, T111819), as well as analytics / research. Potential uses include a purge service, RCStream, and push notifications.
  2. EventLogging: Decode, validate and enqueue JSON events for EL.

See also: T84923.

Architecture Decisions

  • We will standardize on JSON Schema as our canonical schema spec, but do so in such a way that Avro can be used in Analytics type systems. Equivalent Avro Schemas may be generated as part of CI.
  • For MVP, JSON data will be produced to Kafka. We consider Avro Binary later.
  • There will be a Kafka Topic -> Schema mapping, and only that schema can be produced to a topic.

MVP Description

The MVP will consist of:

  • REST Service that validates JSON data against a schema and produces to Kafka.
  • Schema Repository Layout and Topic -> Schema mapping config that Service loads on startup.
  • A TBD implemented use case of this system.

Things we could consider after the MVP:

  • Schema review and CI processes:
    • schema evolution rules
    • Auto Avro schema generation
    • Auto Avro java class generation
  • Schema metadata conventions (fields common to all schemas?)
  • Schema listing and discussion UI
    • Integrate with on-wiki schema storage for EventLogging?
    • Mediwiki Extension?

Other ideas

  • Schema lookup service

Related Objects

StatusAssignedTask
Openaaron
Resolvedaaron
ResolvedOttomata
DuplicateNone
DuplicateNone
DuplicateBBlack
StalledNone
Resolveddaniel
Resolved GWicke
ResolvedOttomata
ResolvedRobH
ResolvedOttomata
ResolvedCmjohnson
Resolvedelukey
ResolvedRobH
Resolvedmobrovac
ResolvedEevans
Declinedcsteipp
Resolvedcsteipp
Resolved GWicke
Resolvedssastry
ResolvedPchelolo
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolved madhuvishy
ResolvedOttomata
Resolved madhuvishy
ResolvedOttomata
Resolvedmobrovac
Resolvedmobrovac

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ori added a comment.Oct 16 2015, 5:55 PM

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

Could you explain how you arrived at the figure of 50k requests per second, which you project for this service?

GWicke added a comment.EditedOct 16 2015, 6:16 PM

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

Could you explain how you arrived at the figure of 50k requests per second, which you project for this service?

This is @Ottomata's projection for analytics use cases. For core events, throughput should be of a lesser concern as rates will likely be in the low hundreds of messages per second.

  1. Already leverages a (really slick) JSON schema registry

Optionally fetching schemas from a URL isn't that hard really. Example code:

if (/^https?:\/\//.test(schema)) {
  return preq.get(schema);
} else {
  return readFromFile(schema);
}

This lets us support files for core events, and fetching schemas from meta for EL. Schema validation is a call to a library.

The main reason that I listed this as a benefit, is because I don't understand why we need to distinguish between classes of events in this way (at the architectural level). Since EL already has an answer for schema registry, it seemed like an advantage.

However, if we assume that we need an additional class of in-tree schemas, then the inverse is also true; It would be just as trivial to implement reading from the filesystem.

  1. Provides a pluggable, composable, architecture with support for a wide range of readers/writers

How would this be an advantage for the EventBus portion? Many third-party users will actually only want a minimal event bus, and EL doesn't seem to help with this from what I have seen.

For starters, it means that we have alternatives for environments where Kafka is overkill (small third-party installations, dev environments, mw-vagrant, etc). Using, for example, sqlite instead of Kafka is already something supported.

There is also a tremendous amount of flexibility here, and even if we assume that we need none of that now, it's impossible to assume we never will. Having the ability to compose arbitrary event stream topologies, from/to a wide variety of sources/sinks, multiplex, and add in-line processing, sounds like a great set of capabilities to base such a project on.

  • schema registry availability

There are more concerns here than just availability (although that's important, too).
Third party users won't necessarily want to give their service access to the internet in order to fetch schemas. We need to provide a way to retrieve a full set of core schemas, and a git repository is an easy way to achieve this.

Third parties could use our schema registry, or use the same extension we do, to host one of their own. Or, (as mentioned elsewhere), we could export snapshots of the relevant schemas via CI to ship along side the code (this seems safe, as a revision is immutable).

We also need proper code review and versioning for core schemas, and wikis don't really support code review. We could consider storing pointers to schemas (URLs) instead of the actual schemas in git, but this adds complexity without much apparent benefit:

I would say that both versioning and review are well covered here. I get your point that it's not as specialized as code review tooling might be, but wikis are an established means for collaboration.

Workflow with schemas in git:

  1. create a patch with a schema change
  2. code review

Workflow with pointers to schemas (URLs) in git:

  1. save a new schema on meta; note revision id
  2. create a patch with a schema URL change
  3. code review

That doesn't seem too onerous to me.

For performance, it needs to be Good Enough(tm), where Good Enough should be something we can quantify based on factors like latency, throughput, and capacity costs that aren't prohibitively expensive when weighed against other factors (e.g. engineering effort).

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

I always find these things difficult to quantify. There are so many variables. If hypothetically speaking, it only saved us a week, what is that worth? What could we do with another week (lost opportunity costs)?

Also, how do you quantify the value of using a piece of software that other teams are already using? Where you have a wider set of active developers, and more eyes on it? Where ops is already familiar with it?

I don't pretend to know the answers to these.

This comment was removed by ori.
This comment was removed by ori.
mobrovac removed a subscriber: gerritbot.
GWicke added a comment.EditedOct 16 2015, 8:26 PM

For starters, it means that we have alternatives for environments where Kafka is overkill (small third-party installations, dev environments, mw-vagrant, etc). Using, for example, sqlite instead of Kafka is already something supported.

As far as I can see, there is no support for using any database as a queue / log in a way that would give us a light-weight alternative to Kafka. I see no support for streaming from a database in EventLogging, and separate tables are created whenever a schema is changed.

So, we'll have to implement this either way. We do have fairly nice async table abstractions for sqlite and cassandra that we could reuse for this in node. Both already implement retention policies. Python has sqlalchemy, which is a pretty nice way to interface with dbs. Retention policies would have to be implemented manually.

Another consideration is that the EventLogging Python code is synchronous, while the node code is async. Efficiently supporting many concurrent streaming clients will likely be difficult using the EL code.

A PR adding remote schema support to the nodejs frontend is now available at https://github.com/wikimedia/restevent/pull/1. This means that we can now choose to use local or remote schemas per-topic in the configuration.

Hey yalls,

I've had requests that we postpone the RFC for this one more week, until Oct 28th. I'd like for one opsen and @ori to be able to attend, and the relevant opsens are all traveling, and Ori can't make this one either.

So, we need to be really careful here. This MVP as of yet has zero buy in from anyone in ops. In addition, both @ori and @Eevans point out that EventLogging already does everything that this MVP encompasses, minus the HTTP service part. Now it is time for me to chime in too, woowee!

Could you explain how you arrived at the figure of 50k requests per second, which you project for this service?

This is just an arbitrary goal, some number we came up with. I'd like to be able to encourage developers to use EventBus for everything they can think of.

We've scaled EventLogging to about 10k / second by using Kafka, but that is only on a single node. EventLogging is horizontally scalable. Need more throughput? Add more partitions and processors.

In addition, the EventLogging processors are doing more than just validating JSON messages. They are parsing the JSON data out of encoded query strings via regexes, wrapping the incoming event data with generic metadata, anonymizing IP addresses using a shared rotating salt key from etcd, sending invalid events off to Kafka as EventErrors, etc. etc.

We also need proper code review and versioning for core schemas, and wikis don't really support code review. We could consider storing pointers to schemas (URLs) instead of the actual schemas in git, but this adds complexity without much apparent benefit:

I think this is true, especially for the 'production' use case of EventBus. EventLogging was originally designed for analytics use cases, some of which are short lived one-off's (A/B testing, whatever). Making quick changes via a wiki is awesome for this. Having more control over changes to production schemas sounds like a good idea.

However, if we assume that we need an additional class of in-tree schemas, then the inverse is also true; It would be just as trivial to implement reading from the filesystem.

Agreed, it would trivial to add filesystem based schemas to EventLogging. In fact, this is sort of already done, via the cached schema system. Schemas needed for unit testing are hardcoded into the source and manually inserted into the in memory schema cache. We could do the same thing with a filesystem tree of schemas: preload them into the in memory cache. When asked to validate a schema from the filesystem, EventLogging wouldn't even bother trying to reach out to meta, since it would already be in the in memory cache. See: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/server/eventlogging/schema.py#L64

Provides a pluggable, composable, architecture with support for a wide range of readers/writers

How would this be an advantage for the EventBus portion? Many third-party users will actually only want a minimal event bus, and EL doesn't seem to help with this from what I have seen.

It does, no? EventLogging is already a useable extension with third party Mediawiki installations. Kafka isn't needed to use EventLogging at all.

See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights the cost of additional hardware.

As mentioned here, here, here, here, and here, comparing the performance now is interesting, but provides little insight as to how this system will perform in the real world with more features. EventLogging is doing more than just accepting a JSON message and validating it against a schema. In any case, this system will need to be horizontally scalable. As noted, the production use case will be much lower volume than the analytics one. The performance of all the solutions we evaluated in T88459 is suitable for the production use case, especially since they are all horizontally scalable.

JanZerebecki moved this task from incoming to hold on the Wikidata board.Oct 21 2015, 12:48 PM
GWicke added a comment.EditedOct 21 2015, 10:05 PM

We are having a hangout meeting tomorrow (Thursday, 22nd) between 11&12am SF time. Please let us know if you'd like to join.

Task: T116247: Define edit related events for change propagation

Agenda:

The EventBus MVP~[1] is moving along and we can now validate and enqueue messages~[2]. The next step is to define the shape of the schemas against which the messages ought to be validated. Aaron's event definitions seem to be a great starting point for the discussion.

[1] https://phabricator.wikimedia.org/T114443
[2] https://github.com/wikimedia/restevent
[3] https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource

@GWicke I would be interested to participate. I'll be in the office, could you add me to the invite?

@GWicke I would be interested to participate. I'll be in the office, could you add me to the invite?

Done.

brion added a subscriber: brion.Oct 28 2015, 8:29 PM

Today's EventBus RFC discussion ended with the general consensus that we will implement this project in EventLogging.

This is a large project that is meant to be useful to many teams. Whatever the final implementation, there will need to be a path forward for deprecating existing use cases (EventLogging, RCStream, etc.) in favor of this system.

Given this project's cross-team generality, and the need to port existing use cases , and barring any practical or technical reasons not to, we will adapt EventLogging to include an HTTP REST Service.

We may need to conform some of our schema designs to EventLogging, and there may be other unknowns that we will discover as work on the implementation.

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
GWicke added a comment.EditedNov 3 2015, 2:25 AM

@Ottomata: In my recollection of the discussion & the log you linked to, the question of which REST producer proxy to use was left open. Our priority is to get basic events into Kafka before the end of this month, so that we can start building on top of this for change propagation. We still haven't finalized the event definitions & still need to tackle the MediaWiki integration, so there isn't really a lot of time left. We have a simple node service that does what we need & integrates with our node infrastructure, but if you have something based on EventLogging soon then we can consider using that too. Lets just make sure that the APIs are compatible & make sense in the longer term.

FWIW, one does not exclude the other: the EL-based service can be used in production, while the node-based REST proxy may be used for development and/or small installs.

In my recollection of the discussion & the log you linked to, the question of which REST producer proxy to use was left open.

I think you may be referring to the first link of the meetbot notes, which was ended before we stopped discussing. Starting at [19:17:26] <robla> in the chat logs, it seems clear to me that the consensus is that unless there are good reasons to ditch something that already does most of what this project is about, then we should adapt what we are already using. If I'm mistaken, please correct me.

if you have something based on EventLogging soon then we can consider using that too. Lets just make sure that the APIs are compatible & make sense in the longer term.

Getting closer here, need some help on the API big time. Will also need to revisit some meta schema design things over on T116247 to make things easier for EventLogging.

Our priority is to get basic events into Kafka before the end of this month, so that we can start building on top of this for change propagation

@GWicke, I think this may be a problem. From my perspective, the goal of this project is a generalized event service with well designed and standardized schemas for all of WMF. For this MVP, we have chosen to model change events because that is what you are interested in. This is an 'MVP', and will likely require iteration after the first deployment. I don't think having a live services production goal based on this is realistic.

FWIW, one does not exclude the other: the EL-based service can be used in production, while the node-based REST proxy may be used for development and/or small installs.

That is one of the reasons for sticking with EventLogging. It is already useable by small installs without Kafka.

Ok, still various TODOs around the code, but this is ready for review.
https://gerrit.wikimedia.org/r/#/c/235671

There are concepts that it'll be good to do close review with folks familiar with EventLogging (probably @Nuria or @Milimetric), and other that I'd really like services folks to look at @mobrovac and/or @Eevans?

@GWicke, I think this may be a problem. From my perspective, the goal of this project is a generalized event service with well designed and standardized schemas for all of WMF. For this MVP, we have chosen to model change events because that is what you are interested in. This is an 'MVP', and will likely require iteration after the first deployment. I don't think having a live services production goal based on this is realistic.

I don't see these two as being mutually-exclusive. In order to meet the end goal of a generalised event service we are starting with the Services' use case. The MVP is part of one of our quarterly goals. We have almost finalised the events and almost settled on the hardware, so from our point of view we are ready to start building our change propagation system which relies on the basic edit events.

And, as you state, the MVP is going to be a first stab at this (a prototype of sorts) which will be improved upon as other services/systems are ported to it. Which points make you think having an MVP up and running this quarter?

Nuria added a comment.Nov 4 2015, 4:20 PM

I don't see these two as being mutually-exclusive. In order to meet the end goal of a generalised event service we are starting with the Services' use case. The MVP is part of >one of our quarterly goals. We have almost finalised the events and almost settled on the hardware, so from our point of view we are ready to start building our change >propagation system which relies on the basic edit events.

I sure hope we are not thinking of having a node rest endpoint and another one based on eventlogging at the same time, More than for technical reasons because it really makes me think that we cannot collaborate. After our irc meeting of last Friday It was clear that the majority of attendees favor a solution based on EL and i was under the impression this is what we were going for.

I sure hope we are not thinking of having a node rest endpoint and another one based on eventlogging at the same time, More than for technical reasons because it really makes me think that we cannot collaborate.

I was talking about the consumer side of things in my previous post, not about the producer side. Our QG is about the whole pipeline: (a) producer; (b) kafka cluster; and (c) change propagation system as a consumer. The way I see it, we have discussed / worked on (a) and (b), so now we'd like to get started on (c) in order to meet our goal.

After our irc meeting of last Friday It was clear that the majority of attendees favor a solution based on EL and i was under the impression this is what we were going for.

As for the python vs node REST proxy discussion, they are supposed to be functionally equal, i.e. they should be interchangeable. Given that the node proxy is ready to use, I don't see harm in us using, in this way allowing us to complete this quarter's goal.

That makes me think - how would the python service be deployed? Does that need some extra (puppet) work? In the case of the node proxy, that's a matter of writing 5 lines in ops/puppet.

Joe added a comment.Nov 4 2015, 4:43 PM

@mobrovac so let me get this straight, we discussed something that was already overridden by an existing implementation?

As far as deploying the python app, who is working on it? I think I can help with deployment/development of the server glue.

@mobrovac so let me get this straight, we discussed something that was already overridden by an existing implementation?

That's right. There's a node implementation which is ready to be used, and there's a WIP python effort. They both strive to fulfil the MVP's requirements from this task's description. The former was created by the Services team as a quick-start solution (not implying this entails lesser quality). The latter is envisioned to be part of the EL codebase, even though, from what I can tell, it does not interact with EL directly (please correct me if I'm wrong).

So, yeah, we have two things doing the same thing because of the question should we reuse (parts of) EL for the event bus system?

faidon added a comment.Nov 4 2015, 5:38 PM

We have a simple node service that does what we need & integrates with our node infrastructure, but if you have something based on EventLogging soon then we can consider using that too.

So either someone else should make it for you (soon) or you'll just use your own thing? No, it doesn't work like that. The entire point of the RFC meeting was so that we could all agree to what we want out of this and find acceptable, make our compromises and make a decision about the direction that we'll go forward to.

This happened, and we first and foremost widely agreed of this being aimed as a single product that will "unify the set of partial and divergent implementations that currently exist". I don't think you've proposed (yet?) a plan for replacing EventLogging for all of its existing use cases — if you do so, we can have that conversation based on those merits. Until then, I don't see why we are even discussing this "node implementation".

Or in another words: the flip side of what you wrote is "we have a complicated piece of infrastructure that has been worked on for years, is battle tested and is actively used for a number of different use cases already — but if you can make restevent reach feature parity with that system soon then we can consider using that too".

So either someone else should make it for you (soon) or you'll just use your own thing? No, it doesn't work like that. The entire point of the RFC meeting was so that we could all agree to what we want out of this and find acceptable, make our compromises and make a decision about the direction that we'll go forward to.
This happened, and we first and foremost widely agreed of this being aimed as a single product that will "unify the set of partial and divergent implementations that currently exist". I don't think you've proposed (yet?) a plan for replacing EventLogging for all of its existing use cases — if you do so, we can have that conversation based on those merits. Until then, I don't see why we are even discussing this "node implementation".
Or in another words: the flip side of what you wrote is "we have a complicated piece of infrastructure that has been worked on for years, is battle tested and is actively used for a number of different use cases already — but if you can make restevent reach feature parity with that system soon then we can consider using that too".

Soooo, I think there's a mix of short-term needs and long-term requirements which do not go hand in hand and we seem to be juggling mostly around them.

Here's the deal the way I see it. Yes, sure, +1k for:

we first and foremost widely agreed of this being aimed as a single product that will "unify the set of partial and divergent implementations that currently exist"

That's the long-term plan. As we agreed in the meeting, not everything can be converted now or soon. What we (=== Services team) have committed on doing this quarter is creating the change propagation system which aims at replacing the (hacky) RestbaseUpdateJobs extension. And that is only a first use case that is to be based on the EventBus MVP outlined in this task. Since the node REST proxy is ready to use, we feel we should use that in the interim so that we can continue work on our goal. To be explicit: I'm not saying we're dismissing the RFC discussions and don't want to collaborate with others. Our ultimate goal is exactly what you described - a unified event bus system for the whole organisation - and only an org-wide consensus will bring us home. But we have to make (small-ish) compromises in the short term in order to meet our QG.

GWicke added a comment.EditedNov 5 2015, 2:23 AM

@faidon: Until very recently (last days), there wasn't actually an EventBus-like REST proxy with schema validation in the EventLogging repository. @Ottomata now has a patch implementing such a service, and @mobrovac has left comments on it today. So, it looks like we'll have the option of choosing between two new services implementing the same API. I don't see having two implementations of a simple service as a bad thing. As mentioned, we might want to use a single node process exposing parsoid, restbase & eventbus for small (third party) installs, but might as well use the new EventLogging service in production.

There are still loose ends to be tied in the API and event schema definitions, and I think that should be our focus. The implementation deserves attention too, but it's easy to swap & a few hundred lines each.

Replacing all of EventLogging is pretty much out of scope for EventBus. The focus is on queuing and event validation, and not on other EventLogging features like Varnish log decoding, analytics databases etc. If desired, we could fairly easily add HTTP event production in EL, which would write to EventBus instead of directly to Kafka. However, I personally think it's fine to let trusted producers write directly to Kafka, especially for internal applications. The current EL instance is producing to a separate (analytics) Kafka cluster in any case, so there is no potential for conflicts with non-analytics use cases.

Until very recently (last days), there wasn't actually an EventBus-like REST proxy with schema validation in the EventLogging repository.

Not quite true, this was started Sept 3.
https://phabricator.wikimedia.org/T88459#1601022

faidon added a comment.Nov 5 2015, 4:20 PM

More importantly, I don't understand why this is something Andrew has to do (and "soon") and not the services team "or else".

Why is it a given that the Services team is going to exclusively work on their choice of tech and, if consolidation is required, someone else must adapt their world to yours (and make it "a joint effort") to achieve that?

Nuria added a comment.Nov 5 2015, 4:27 PM

As mentioned, we might want to use a single node process exposing parsoid, restbase & eventbus for small (third party) installs, but might as well use the new EventLogging service in production.

To date we do not have a third party install small use case but rather an internal production one (edit stream) so let's focus on this one and thus let's focus on adapting the EL change.

I don't see having two implementations of a simple service as a bad thing.

I certainly disagree , I could see how we could have (for testing) a node mockup service for rest endpoint for -for example- a vagrant role, but I cannot see two full fledged systems doing the same thing as a positive outcome, rather it signals to me " duplication of effort" .

What we (=== Services team) have committed on doing this quarter is creating the change propagation system which aims at replacing the (hacky) RestbaseUpdateJobs extension. A

A quaterly goal is not good for itself, it is a means to provide value to the organization. In this case duplicating efforts is producing technical debt and a lot of friction.

Hi all, I talked to @GWicke a little bit more about this last Thursday. He impressed upon me a couple of good points I hadn't fully taken in before, and I want to recognize them.

Simplicity - Services is concerned that the REST service needs to be very reliable and not buggy. restevent is simple, and if it never needs to do anything beyond this MVP, it will not need many changes or deploys throughout its lifetime. Conversely, EventLogging is a codebase that is often worked on and improved. Using this established and more featureful codebase provides a lot of benefit, but brings with it risk of instability due to changes.

EventLogging deprecation - A big concern expressed at the RFC was proliferations of systems and the effort needed to port old systems over to EventBus if we were to use restevent. EventBus is about 2 things: standardizing WMF events, and getting valid events into a pub/sub for many consumers to use. The scope of this MVP does not include consumption of events, and much of the EventLogging codebase is about consuming, not producing. Using restevent does not mean that EventLogging will be deprecated. EventLogging does much more than restevent ever would. There will be other systems that will need to be ported to EventBus, but this is true independent of the EventLogging vs restevent discussion.

That said, I still think we should move forward with EventLogging as we have been and as the general consensus in the RFC indicated. I have discussed these points with some folks, and even though they are valid concerns, I don't think that they outweigh the pros of using and improving an established working system. The risk of instability due to active development of EventLogging can be addressed with common release management practices. I.e. we can version well and deploy only stable releases to the HTTP service. And even though we wouldn't need to port EventLogging to use restevent, there is duplication of effort here, as EventLogging is built to do most of what this MVP is about.

Also, I'd like to note that building an HTTP produce service that fits this MVP in EventLogging is more work than is in restevent. This is mostly due to the schema constraints (I.e. EventCapsule) that EventLogging was originally built with. The work we are doing to make EventLogging work with more generic meta data is valuable beyond just this MVP, so I think it is worth it.

To be explicit: I'm not saying we're dismissing the RFC discussions and don't want to collaborate with others. Our ultimate goal is exactly what you described - a unified event bus system for the whole organisation - and only an org-wide consensus will bring us home. But we have to make (small-ish) compromises in the short term in order to meet our QG.

I think this is one of the sources of conflict. I originally proposed a very generic solution to an org wide problem, and Analytics has committed to work on the initial infrastructure MVP that solves the generic problem this quarter. Services wants to use the solution that solves the org wide problem, but has additionally committed to an additional goal that depends on the generic solution. Services wants something that works for them now, and will make it work for others later. Analytics is interested in the more generic problems first.

Services goal + the fact that EventLogging is more work is worrysome for Services. I believe that we can get EventLogging ready in time to meet Service's needs, but I'm not excited about promising it. As a back up and with Ops' go-ahead, I think it would be ok to use restevent as a stand in until EventLogging is ready, especially for the month of December so that Services can continue to work on their goals.

@GWicke, @mobrovac, @Eevans, and @Nuria, I propose we set up a twice-a-week-EventBus-standup-sync-up-party-meeting to help us better collaborate and be in sync. How about Monday and Thursdays?

Hi all, I talked to @GWicke a little bit more about this last Thursday. He impressed upon me a couple of good points I hadn't fully taken in before, and I want to recognize them.
Simplicity - Services is concerned that the REST service needs to be very reliable and not buggy. restevent is simple, and if it never needs to do anything beyond this MVP, it will not need many changes or deploys throughout its lifetime. Conversely, EventLogging is a codebase that is often worked on and improved. Using this established and more featureful codebase provides a lot of benefit, but brings with it risk of instability due to changes.
EventLogging deprecation - A big concern expressed at the RFC was proliferations of systems and the effort needed to port old systems over to EventBus if we were to use restevent. EventBus is about 2 things: standardizing WMF events, and getting valid events into a pub/sub for many consumers to use. The scope of this MVP does not include consumption of events, and much of the EventLogging codebase is about consuming, not producing. Using restevent does not mean that EventLogging will be deprecated. EventLogging does much more than restevent ever would. There will be other systems that will need to be ported to EventBus, but this is true independent of the EventLogging vs restevent discussion.

I'd also add to this list the out-of-the-box support for:

  • worker monitoring / automatic restarting
  • easy configuration
  • logging and metrics support
  • easy and quick deployment in production

I think this is one of the sources of conflict. I originally proposed a very generic solution to an org wide problem, and Analytics has committed to work on the initial infrastructure MVP that solves the generic problem this quarter. Services wants to use the solution that solves the org wide problem, but has additionally committed to an additional goal that depends on the generic solution. Services wants something that works for them now, and will make it work for others later. Analytics is interested in the more generic problems first.

We are too, but we need the change-propagation system not only because it's our QG, but also because it allows us to continue our work in the services segment (most notably, pre-generation for back-end services).

@GWicke, @mobrovac, @Eevans, and @Nuria, I propose we set up a twice-a-week-EventBus-standup-sync-up-party-meeting to help us better collaborate and be in sync. How about Monday and Thursdays?

Having weekly meetings seems like a good idea. How about we start once per week and take it from there?

Sounds good. Shall I just find a time and set one up?

Just made a calendar event for Tuesday at 10:30 PST. Happy to move it if some other time is better.

Deskana moved this task from Needs triage to Tracking on the Discovery board.Dec 3 2015, 7:20 PM

Will the MVP include being publicly accessible, i.e. anyone on the Internet can run a consumer?

Will the MVP include being publicly accessible, i.e. anyone on the Internet can run a consumer?

I suspect not, although I do hope the architecture is designed in such a way that websocket proxies (or even straight kafka proxies, with authentication!) are easy to setup :)

No, consumption is not part of the MVP.

There may be future work to make consumption from Kafka via websockets easy to set up, but we will not make any Events public by default. We will have to set up special endpoints for approved event streams.

Ottomata moved this task from Backlog to In Progress on the EventBus board.Feb 1 2016, 4:58 PM
Milimetric moved this task from Analytics Query Service to Radar on the Analytics board.

We will resolve this after T120212 is closed, and after we have the first consumer (change propagation) in production.

@mobrovac - I'm confused, why don't you think T120212 is a blocker for this?

@mobrovac - I'm confused, why don't you think T120212 is a blocker for this?

It is, but it's an indirect one: it is blocking T116786: Integrate eventbus-based event production into MediaWiki which is a blocker for this task and whose resolution depends solely on T120212 .

I realize that the blocking relationship is transitive, but given Otto's comment (T114443#2072426), it would seem that it would be clearer to make the blocking relationship specific, rather than obscuring it in a hierarchy. Would you mind if I put T120212 as a direct blocker for this task?

I realize that the blocking relationship is transitive, but given Otto's comment (T114443#2072426), it would seem that it would be clearer to make the blocking relationship specific, rather than obscuring it in a hierarchy. Would you mind if I put T120212 as a direct blocker for this task?

Nope. {{done}}

mobrovac closed this task as Resolved.Apr 27 2016, 7:12 PM

And we're done here!

Krinkle edited projects, added TechCom-RFC (TechCom-Approved); removed TechCom-RFC.
Krinkle moved this task from Untriaged to Implemented on the TechCom-RFC (TechCom-Approved) board.