Page MenuHomePhabricator

Modern Event Platform: Stream Intake Service
Closed, ResolvedPublic0 Estimated Story Points

Description

As part of Modern Event Platform, we plan to build a scalable event stream intake system that will allow us to accept validated events from both internal and external clients. The working name of this component will be Stream Intake Service (subject to change).

This system needs to be reliable and horizontally scalable. It should accept events synchronously (with validation responses, etc.) and asynchronously. It should accept events via HTTP POST. This component also includes any client libraries we might build that interact with the Event Intake Service as part of this component.

See also T225237: Better Use of Data

User Stories [draft]

  • As an engineer, I want to produce valid events from internal networks so production features can use event systems.
  • As an engineer, I want to produce valid events from external clients so we can do comprehensive analysis based on events.
  • As an engineer, I want to produce events and get an appropriate response if my event is produced so I can build production backend features.
  • As an engineer, I want to produce events without a response if my event is produced so I collect "fire and forget" analytics data.
  • As an engineer, I want to batch produce many events at once so mobile apps can produce events after an offline period.
  • As an engineer, I want good client libraries to produce events so that I don't have to write them myself.
  • As an engineer, I want client libraries that can query for sampling settings and other schema topic usage metadata.
  • As an engineer/analyst, I want to be able to trigger events in a developer mode, regardless of sampling settings, so I can more easily debug problems.
  • As an engineer, I want to restrict usages of my event topic via authentication and authorization so that I don't get garbage data from internet trolls.
  • As an analyst, I want alarms for implausible value combinations (if field A has value X, then field B has to be > 200, or such) (this might not be part of event intake, more likely it is stream processing)

Related Objects

StatusSubtypeAssignedTask
ResolvedOttomata
ResolvedOttomata
OpenNone
ResolvedMilimetric
ResolvedNone
ResolvedOttomata
ResolvedOttomata
DuplicateOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedsbassett
ResolvedOttomata
Openbrouberol
Declinedakosiaris
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolved Pchelolo
ResolvedOttomata
ResolvedEvanProdromou
ResolvedEBernhardson
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedNone
ResolvedOttomata
DeclinedNone
ResolvedOttomata
Resolved Pchelolo
ResolvedOttomata
Resolvedakosiaris
ResolvedOttomata
ResolvedOttomata
Resolved Pchelolo
ResolvedOttomata
Duplicate Pchelolo
Resolved Pchelolo
ResolvedHalfak
Resolved Pchelolo
Resolved Pchelolo
Resolved Pchelolo
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolved Pchelolo
ResolvedOttomata
ResolvedMarcoAurelio
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedNone
Resolved jlinehan
Resolved jlinehan
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
OpenNone
ResolvedOttomata
ResolvedOttomata

Event Timeline

Ottomata triaged this task as Medium priority.Aug 2 2018, 6:37 PM
Ottomata created this task.
  • As an engineer, I want to be able to guarantee production of the event and be able to retry until the event is indeed produced T120242

As an engineer, I want good client libraries to produce events so that I don't have to write them myself.

Could this be extended to include client-side validation of events as an explicit requirement or would you prefer a separate AC?

As an engineer, I want good client libraries to produce events so that I don't have to write them myself.

Could this be extended to include client-side validation of events as an explicit requirement or would you prefer a separate AC?

Could you elaborate on that? There would be validation upon event message intake (i.e. that they conform to a particular schema). Are you referring to extra validation on the client that could fall into the category of client-side even filtering?

More generally, I don't think client-side validation should be a requirement because the platform must ensure events are valid if they are in the system.

The EventLogging javascript (optionally?) does client some client side validation. If this fails, the event is not sent to the server, and the client has more immediate feedback. This could be supported for the new system too.

Could you elaborate on that? There would be validation upon event message intake (i.e. that they conform to a particular schema).

As @Ottomata said in T201068#4481206, there is some client-side validation in place already for a subset of JSON Schema draft 3 (see T182094).

More generally, I don't think client-side validation should be a requirement because the platform must ensure events are valid if they are in the system.

I think that otherwise well-behaved clients sending a lot of invalid data due to a bug could be problematic as well as a waste of resources. That being said, client-side validation errors – at least the global rate? – would need to be visible to developers.

The client must have some knowledge of the schema registry in order to "query for sampling settings and other schema topic usage metadata". Since this is the case, it seems reasonable to extend that knowledge to be able to fetch the schema itself. Indeed, the choice of a platform-agnostic serialisation format (especially a text-based one) seems to support this by making it cheap to share validation rules between different parts of the platform.

I do agree, however, that it is ultimately the responsibility of the platform to validate events sent by any client. Perhaps this is actually a discussion for T201063: Modern Event Platform: Schema Repostories?

IMHO, relying on client libraries for validation is not really an option if we want to ensure the well-functioning of the platform, given its stated openness. In EventBus we currently have server-side validation which is an aspect that I think we should keep (whether in the current form or a different one).

Yaya, I don't think anyone wants to get rid of server side validation; that will always be present. The suggestion is to optionally also have the client libraries support client side validation, before the event is sent to the server.

Implementing a new event in the current Event-Platform system made me think about a fairly random idea. We have a set of well-defined interfaces within MediaWiki, like RevisionRecored, User etc. Most of the events we will be "intaking" directly from MediaWiki will eventually have these MW interfaces recorded - some events will include info about the revision, some about the user, some about a particular revision slot etc.

What if we think of a standardized way to convert MW PHP constructs (or MW database schema constructs) into JSON. The easiest I can imagine is to make first-class MW interfaces JsonSerializable and make sure all serialize into JSON conformant into schemas this system would use.

A less drastic (towards MW) approach could be to create an MWJsonSerializer abstraction (or even MWDBLogSerializer) abstraction, that would know how to print all the MW primary abstractions into a JSON representation that would follow the schema we defined within the system and submit to ONLY using these whenever we're incorporating information about something. In some wildest dreams, those serializers could even be auto-generated from the schema if we annotate it with some custom annotations (kinda like Jackson in the Java world) or accompany a schema with XSLT.

I might be overengineering this here, but we're shooting for the stars right? :) For a more practical example of this idea, the current createRevisionRecordAttrs method in EventBus is a simplest approach to this, but we should think about using this approach for EVERY MediaWiki interface and only use these with absolute minimal coding on top of it.

I like this idea, with the exception of making these classes JSON-serialisable. These objects may (and probably will) deliver more information than needed for our events, so we are really looking for a subset here, i.e. we should require they be EventBus-serialisable. Given this year's Platform Evolution programme's aim at rethinking interfaces inside MW, this can be part of that.

Maybe instead of JSON-serializable, they could be array (object/dict) serializable? Or have a toArray function? Then we could just JSON serialize the simple PHP array.

I second @mobrovac concerns. I think that mixing entities (user) with events (revision-create) it is likely to give us event schemas in which is very hard to find information (which is what happens now on mw db schema) . I caution against importing directly mw entities, or attaching events like "user-changed-name" (invented) to the user object, entities and state changes both can of course be represented in JSON but an entity (user) ideally should be an inmutable object and no state changes should be attached to the entity representation.

As an engineer, I want to be able to guarantee production of the event and be able to retry until the event is indeed produced T120242

@Pchelolo I'm not sure if we should include this one. Retrying at the EventBusish* level means there would need to be some kind of local queue and storage managed by EventBusish. We can and should tune the EventBusish Kafka client for maximum reliability, but if the Kafka cluster rejects the message, I don't want EventBusish to have to manage the retries (right?). I think T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth is more about retries for events at the client side and/or atomic event production via MySQL binlogs or some such thing.

(* using 'EventBusish' here because the new system may or may not be EventBus service)

Client-side validation feels like shipping test / debug code to production. This only prevents programmer errors. In my opinion, it would be ideal if validation was default opt-out and could be enabled with a query parameter.

Client-side validation feels like shipping test / debug code to production.

I disagree, it is a nice to have cause no developer can anticipate the data the schema will handle, the developer deals with basic correctness but when than code that harvests data gets executed by hundreds of millions is quite likely that errors come from the variety of data users report (programming errors too, of course) . Any service will need to validate server side but validating early and often is of use.

I just wanted to clarify that I find client-side validation very useful and I hope it sticks around. My proposal is only to make it optional by query param and disabled by default in production, not to remove it.

It seems to me that once code is verified to be working properly and committed, we should generally assume it will continue to work correctly and send events assuming they will be recorded successfully on the server. The alternative is that all users must download extra validation code that only reports failures on the client that are invisible to the user except in the rare case that they have a developer console open. Even if we reported these client-side errors to the server, it would be redundant because we have a way to check for those validation errors already if we send the events and the let the server do its usual validation.

My reasoning is that the tradeoff of requiring debug mode to be enabled during development to see validation errors client-side is worth the data savings for users. Similarly, it's easier to debug unminified JavaScript but the data savings given by minified yet difficult to read JavaScript makes it worthwhile to make minified the default production behavior and unminfied an opt-in query parameter.

I just wanted to clarify that I find client-side validation very useful and I hope it sticks around. My proposal is only to make it optional by query param and disabled by default in production, not to remove it.

Ah, got it, this work is happening right now here so your wish is granted: https://phabricator.wikimedia.org/T187207

I'm not sure if it is needed to be written down explicitly but one thing that product engineers will do when adding instrumentation (obviously) is test and QA event logging, validation, sampling, etc. locally in their mediawiki and mediawiki-vagrant installs. Also QA/analysts/engineers sometimes perform these tests also in either beta cluster or some cloud labs instance that the teams have set up.

Seems related to:

As an engineer/analyst, I want to be able to trigger events in a developer mode, regardless of sampling settings, so I can more easily debug problems.

But more along the lines of something like:

As an engineer/analyst/QA, I want to be able to trigger and verify events in local development, mw-vagrant labs instances, and staging environments, to verify that instrumentation code works correctly.

Ottomata renamed this task from Modern Event Platform: Scalable Event Intake to Modern Event Platform: Stream Intake Service.Sep 24 2018, 6:10 PM
Ottomata updated the task description. (Show Details)

I want to produce events and get a synchronous response if my event is produced so I can build production backend features.

I'm not entirely sure what this means (also: I find it easier to translates feature requests to technical implementations, instead of SCRUM stories, but it might be me).

Depending on what you mean, this can be a very difficult thing to achieve and allow engineers to follow antipatterns.

More specifically: I don't think there is any case in which acknowledging the writing of an event to the platform should block either the client or the server. I would say that being able to poll the server to know if a message was intaken could work in specific cases, but for producing events from any software serving web traffic, anything other than the "fire and forget" pattern is a potential cause of latencies and even of outages.

As an example, imagine MediaWiki can produce events and wait for acknowledgement. Now imagine the platform is for some reason responding slow, that would cause latencies and pile up requests on the application servers waiting for their events to be acknowledged. Of course the cure for that would be to set a low timeout on the request on the client side, but that would translate to "fire and forget unless everything works ok", and I'm not sure what's the value in that.

For things that need acknowledgement, we should have a way to check asynchronously that the event we tried to submit has indeed been intaken correctly.

anything other than the "fire and forget" pattern is a potential cause of latencies and even of outages.

Instinctively, I agree, though I have not investigated the details here. But it seems that, unless we are using an event bus that has guaranteed delivery semantics, we can't ever be sure that an event is received by all consumers (or any consumer). Why then should we care whether it has been received by the intake/platform/proxy/gateway component? If it's all just "best effort" anyway, why block?

The only reason I can think of for this is to allow the intake to validate the event, and tell the sender whether it's valid. Is that the reason behind this requirement?

The only reason I can think of for this is to allow the intake to validate the event, and tell the sender whether it's valid. Is that the reason behind this requirement?

Yes. And actually, fire and forget should still be supported. I recently added a WIP architecture diagram to the parent MEP task: https://phabricator.wikimedia.org/T185233 You can see there is still a /beacon endpoint (probably still varnishkafka?) that can be used for fire-and-forget.

For backend (and possibly some frontend to), there are use cases where you want to know if the event made it into Kafka. EventBus services does this now; it returns an HTTP error if the event is invalid, or if there is some server side error, or the event timesout during producing to Kafka, etc.

For things that need acknowledgement, we should have a way to check asynchronously that the event we tried to submit has indeed been intaken correctly.

I believe the response to the Kafka client is returned and checked asynchronously; but in the context of the http request, this does mean that the request needs to stay alive until Kafka receives the message and an HTTP response is returned to the client. It doesn't mean that a kafka.send() call itself will block, but the whole request would have to stay alive until some async ack callback is fired.

As an example, imagine MediaWiki can produce events and wait for acknowledgement.

This is what happens now. Mediawiki will log an error to logstash (or wherever) if the event isn't produced by EventBus to Kafka.

I want to produce events and get a synchronous response if my event is produced so I can build production backend features.

More specifically: I don't think there is any case in which acknowledging the writing of an event to the platform should block either the client or the server

Agreed. That is why results of having sent an event are to be received on a callback if at all. I think the wording of use case needs to be changed to "asynchronous" from "synchronous"

Agreed. That is why results of having sent an event are to be received on a callback if at all. I think the wording of use case needs to be changed to "asynchronous" from "synchronous"

Both use cases are listed:

As an engineer, I want to produce events asynchronously without a response if my event is produced so I collect "fire and forget" analytics data.
As an engineer, I want to produce events and get a synchronous response if my event is produced so I can build production backend features.

Maybe I need to change the wording. These use cases are just calling out the difference between fire-and-forget (where you have no idea if the event was accepted) and fire-and-get-a-response. I don't mean synchronous in the block-everything-send-call sense, so perhaps I should change the wording.

Changed wording to:

  • As an engineer, I want to produce events and get an appropriate response if my event is produced so I can build production backend features.
  • As an engineer, I want to produce events without a response if my event is produced so I collect "fire and forget" analytics data.

(also: I find it easier to translates feature requests to technical implementations, instead of SCRUM stories, but it might be me)

(btw me too but I am straddling different worlds here!)

For things that need acknowledgement, we should have a way to check asynchronously that the event we tried to submit has indeed been intaken correctly.

I believe the response to the Kafka client is returned and checked asynchronously; but in the context of the http request, this does mean that the request needs to stay alive until Kafka receives the message and an HTTP response is returned to the client. It doesn't mean that a kafka.send() call itself will block, but the whole request would have to stay alive until some async ack callback is fired.

That was my point; an healthy and decoupled workflow would be something like:

1 - client sends an even to the server
2 - the server responds with a request ID immediately (or maybe after doing something like local json validation)
3 - the client stores this request somewhere for use in subsequent USER requests
4 - A new client calls the server asking for the status of the specific request ID
5 - the server responds with the status of that request id

This is the only reliable way to get message acknolwedgement without creating a potential tight coupling between the client and the availability/latency of the server. Any other system would need to set steep TCP timeouts on the calls from the server, or risk cascading failures, and I don't think primitives encouraging such behaviour (so call - wait for ack - respond); you won't get eventbus blocked on kafka, but you might get MediaWiki blocked on eventbus.

This is the only reliable way to get message acknolwedgement without creating a potential tight coupling between the client and the availability/latency of the server.

An error callback (with a timeout) will also not block mediawiki in any case as it is just a promise that either success of fails (or times out), right? Seems a lighter way of achieving the same decoupling, right?

As an engineer, I want to batch produce many events at once so mobile apps can produce events after an offline period.

Hey @phuedx @Mholloway @Jhernandez Q for yall. @Pchelolo and I are working on this and thinking about the API. Do you think you'll need to be able to batch produce (in a single HTTP request) events of different schemas and/or events that are destined to different streams (AKA Kafka topics?). E.g. would you need to produce something like POST [{user login event}, {user search event}, {user read event}, {user pref change event}]? Or will it always be more like POST [{user search event}, {user search event}, ...] then another POST [{user read event}, {user read event}, ...] ?

In the iOS app, events of all kinds are queued to be posted in batches at opportunistic times. Ideally we'd be able to POST [{user login event}, {user search event}, {user read event}, {user pref change event}], but we could split into distinct batches by event type if necessary.

but we could split into distinct batches by event type if necessary.

I think we should support inter mingled events (schema-wise) so as not to drain user's batteries with many beacon requests.

heh, so it seems our beautiful idea for the API have just crashed into the rocks of reality @Ottomata

K here's our thoughts about this:

In order to support multiple event schemas and streams in the same POST, each event needs to indicate what its schema is, and what its destination stream (topic) is. We currently do this in eventlogging-service-eventbus with a meta.schema_uri and a meta.topic.

We are trying to build a generic open source service that is not schema or (WMF) implementation specific. If we want to support this use case, then the service needs to be somewhat opinionated (configurable at least) about this fact. It needs to require that the POSTed events have this contextual data with them. The fact that the data is stored in the event IS an opinion (a WMF one). It could be made a required convention for this service (that's what I've programmed it to do so far), but it feels wrong. The schema and especially the destination stream are not event data, they are information about the transfer of data. If we required that every POST had the same event data schema and destination, we could provide that information in the POST request params and/or headers, rather than inside the events themselves.

I'm a little waffly here! I agree that supporting multi-event type POST is a really nice feature, and it isn't difficult to actually implement. Its just hard to implement agnostically.

Hm Hm hmmmm

Well, in reality there are two extremes: (i) all messages can go into one topic, i.e. you do need to know the schema in order to validate a particular event; and (ii) events in each topic can have only one schema assigned to them, i.e. the schema is predefined for each topic. Making the convention of having a field in the event itself is a sane and safe assumption as it can cover both of these extremes very easily. A third possibility is that people don't want to validate events at all (if we talk about users outside of WMF), but that could/should be an implementation configuration detail.

Let's see, can't you not have a "fallback chain" to decide of what schema/topic are streams directed to? If a header like "destination-stream: EditRevision" is present you can use that (and events will be "clean" of schema info) but if you have a header like "destination-stream:multiple" we expect to have the topic/schema present on the events themselves. Absence of header could also be an indicator.

This might work as a "convention" on our system which is simpler (i think) that having explicit "configuration" as to where schema/topic mappings are to be found in the payload.

As an engineer, I want to batch produce many events at once so mobile apps can produce events after an offline period.

Do you think you'll need to be able to batch produce (in a single HTTP request) events of different schemas and/or events that are destined to different streams (AKA Kafka topics?). E.g. would you need to produce something like POST [{user login event}, {user search event}, {user read event}, {user pref change event}]?

We (Readers Web, at least) have accepted the cost/limitation of one HTTP request per event. Even having a service that can take multiple events of one schema is exciting from a performance perspective!

If this endpoint were to be added to your service, the EventLogging Core JavaScript library would need a lot (?) of work to make it capable of batch sending events after some offline period, which doesn't appear to be in scope for this project.

If this endpoint were to be added to your service, the EventLogging Core JavaScript library would need a lot (?) of work to make it capable of batch sending events after some offline period, which doesn't appear to be in scope for this project

I think it is in scope for Modern Event Platform, although it will probably be a while before that work makes it in to active development.

If this endpoint were to be added to your service, the EventLogging Core JavaScript library would need a lot (?) of work to make it capable of batch sending events after some offline period, which doesn't appear to be in scope for this project.

Correct. Now, eventlogging javascript events are only one of many events venues we support, there is also eventbus, change prop, mediawiki avro events.... The intake service with ability to batch events will be used by other clients initially as soon as next quarter.

Ottomata updated the task description. (Show Details)