T185233: Modern Event Platform (TEC2) describes the high level components that will make up the Modern Event Platform program in FY2018-2019.
T201068: Modern Event Platform: Stream Intake Service is the parent task for Stream Intake service planning and implementation work. It also collects the user stories/requirements for this service that were collected in Q4 of FY2017-2018 during stakeholder interviews.
This RFC outlines the current state of our event intake systems and some possible ways forward. The outcome of this RFC should be to choose an implementation option. This RFC is not intended as a design document.
How things work now.
The original EventLogging system uses Varnish and URL encoded JSON in HTTP query parameters to intake events from external clients. varnishkafka then sends a log of the request to Kafka, which is consumed by EventLogging python processes to decode and validate the events. The events are then produced as JSON strings to Kafka topics for downstream systems.
The EventBus HTTP Proxy service is an internal Tornado HTTP server that listens for POSTs of event JSON body. The EventBus service itself is a part of the EventLogging python codebase. Incoming events are validated against a schema, and then produced to Kafka for downstream systems.
Eventlogging/Architecture Has a description of how the original EventLogging event intake system works.
EventBus has some more information on the evolution of this original idea in EventLogging for production (non-analytics) services.
What we want
More detailed user story/requirements are in T201068: Modern Event Platform: Stream Intake Service.
We want a scalable unified event stream intake system for both analytics and production events. Currently analytics events must be one submitted one at a time asynchronously (with no client feedback) to Varnish as HTTP GET URL encoded query parameters. 'production' events are POSTed to the separate EventBus service. We'd like to unify these systems and add some features.
This stream intake service will retrieve schemas from the Schema Repository described in T201063 to validate events.
I'd really like it if this service was generic and well written enough to be used by others than Wikimedia. There is a need in the Kafka community for an open source JSONSchema based component like this.
EventBus as it is today grew out of discussions in T88459 and T114443. Much of the debate there was around the issue this RFC wants to address: to adapt EventLogging or to write something new. (In hindsight, it would have been difficult to write a NodeJS based service at the time, due to the lack of a functioning Kafka client (fyi node-rdkafka has been working great for us for the last year or so).) I was on the fence between these options, but was swayed by arguments for using an existing system, as were many others. We chose to implement the EventBus service in EventLogging. In doing so, we added code to make EventLogging understand how to use a new non-Mediawiki specific schema type (with no EventCapsule). This was done with the intention of hopefully one day unifying the analytics+Mediawiki schemas into a more generic event-schemas like conventions. If that EventLogging codebase unification work is to be done, this Stream Event Intake component would include that work.
However, I think we should take this moment to re-evaluate our decision to use the EventLogging codebase for event stream intake (especially since we want to expose this endpoint publicly), and consider implementing something more specific to the use cases described in T201068.
Many of the desired features are either already handled by the EventBus service (part of EventLogging python), or could be added to it. However, we should take this moment and use the RFC to decide if we want to do so, or if we should start again with something else, either leveraging service-template-node or adapting something else, like Confluent's Kafka REST Proxy.
This RFC should be used to decide if we want to continue with EventBus (and EventLogging python codebase), or to use something else.
Option 1: Continue to use EventLogging + EventBus service
To do this, I'd include the following work:
- Refactor EventLogging service.py code to support Python 3, latest Tornado, and review the async/coroutine handlers.
- Move topic/config mapping out of event-schemas repository, and make service.py support remote topic configuration (from the Stream Configuration service.
- Refactor some EventLogging internals to decouple the Mediawiki specific parts.
- Refactor some EventLogging internals to decouple the on wiki schema registry
- Solve this Kafka timeout + EventBus bug: T180017: Timeouts on event delivery to EventBus
- (possibly) add client authentication and authorization
Option 2: Write a new service service-template-node
In 2015, our service-template-node based production services were fewer, and we had zero NodeJS Kafka client usages deployed in production. In 2018, we have several more nodejs prod services, are moving towards deploying them in Kubernetes, and have at least 2 NodeJS production services using Kafka. Given this situation, I believe that implementing a new service would not be more work than the changes needed to EventLogging.
Option 3: Adapt Confluent Kafka REST Proxy
I've been experimenting with REST Proxy to see if this would be possible, and I've concluded that it is.
To do this, we'd need to fork Kafka REST Proxy to add the following features:
- Pluggable JSON Schema validation
- This would be a pluggable API to allow JSONSchema lookup from a JSON value (and or topic name) to do validation of that value.
- configurable topic name transformation
- EventBus does some topic name transformation when it produces an event to prefix it with the data center name
- Restricted topic -> schema mapping (possibly via Stream Configuration service)
- Restricted API usage for frontend (no consumer API, /topics only)
- This might be done with a separate front end proxy that only allows certain REST URIs (e.g. POST /topics) through
Stop using EventLogging/EventBus in favor of something else.
Since this RFC's goal is to decide if we should use something other than EventLogging/EventBus, it is not necessary to officially decide between Option 3 or Option 2 here. After exploring the REST Proxy codebase, I currently lean towards adapting it for our needs (Option 3). We may still encounter something that would make us change our mind and go with Option 2 instead (SRE pushback for JVM, bad performance, etc.). REST Proxy is pretty widely used by the Kafka community, and by using and building in JSONSchema support we'd be doing this community a favor. We may or may not be able to upstream our changes to Confluent.