T185233: Modern Event Platform describes the high level components that will make up the Modern Event Platform program in FY2018-2019.
In order to proceed with further system design RFCs, we need to take some time to decide on the type of event schemas we want to use.
How things work now
(Some of this is also summarized at https://wikitech.wikimedia.org/wiki/EventBus#Background)
EventLogging (for Analytics) has historically used different draft versions of JSONSchema to model events at different points in the event pipeline (see T46809 and T182094). It also requires that all schemas are stored in a single on-wiki format. These schemas are expected to be automatically wrapped in an [[ https://meta.wikimedia.org/wiki/Schema:EventCapsule | EventCapsule ]]. This EventCapsule is Mediawiki specific. Since the EventCapsule is not referred to by any event JSONSchema directly, it means that any particular event schema requires the python EventLogging codebase to stitch together the user submitted event data with capsule meta data. These schemas are tightly coupled to Mediawiki, meta.wikimedia.org, and our production installation of EventLogging server side processes.
The EventBus HTTP Proxy service (itself a piece of the python EventLogging codebase) addresses some of these problems for production events. Instead of using the Mediawiki & EventCapsule on-wiki draft 3 JSONSchemas, it uses a local checkout of the mediawiki/event-schemas repository. These schemas are all self contained, and went through a thorough extensive consistency design discussions (AKA bike-shedding :p). Instead of requiring an external service to wrap the event data with a capsule schema, all events contain a common generic meta subobject.
Motivations
The Modern Event Platform program's main motivations are:
- To unify WMF's event processing infrastructure
- Use best practices and existing open source tools (where possible) rather than writing our own
- Ease of use for engineers
Note that we aren't particularly concerned with performance here, at least not for the remainder of this RFC. (We use JSON for webrequest logs and are doing just fine.)
Decision points
Before making other tech decisions pertaining to Modern Event Platform, we first need to decide on an event schema and serialization technology. Schemas and serialization are not necessarily coupled together, but the schema types we might consider all come with their own serialization formats and tooling, so we will consider both the schema and serialization choice together.
This decision comes down to choosing between JSONSchema with JSON, a text based serialization format, vs binary ones. We will not be considering text serializations other than JSON. There are multiple possible binary techs choices, including Thrift, Protobufs and Apache Avro. When comparing these to JSONSchema in a streaming context, many of the pros and cons are the same. There are of course differences, but the ones that matter for Modern Event Platform are around ease of use and integration with existing technologies. Avro is a relatively ubiquitous choice in the event/streaming world, and has built in support with stream processing platforms and datastore connectors. With this larger context, we will not be considering binary choices other than Avro.
Some detailed pros and cons of each solution are outlined below, but in general:
Avro would allow us to make use of existent open sources technologies and strongly typed data, but comes with the pain of dealing with streaming binary data.
JSONSchema would require us to do more integration implementation work now and be less confident in our loosely typed data integration. JSON is also much easier for downstream clients to consume.
Note on internal vs external usage: we will continue to use JSON for serialization of external events. This includes intake of EventLogging style events and publishing of events in EventStreams. If we choose a binary schema/serialization, we will need to do conversion for both incoming and outgoing events. While very possible to do, this goes against the ease of use motivation for Modern Event Platform.
Solution 1: Apache Avro
Avro is a schema specification with binary encoding and strong compatibility constraints.
Pros:
- compact, fast, binary data format, as well as a JSON text encoding
- simple integration with dynamic languages
- backwards and forwards schema compatibility and evolution rules (it is possible to read binary Avro with any version of a schema)
- strongly typed for easy integration with typed and non typed languages and systems.
- Confluent platform has many open source components available to work with Avro in a streaming context
- Confluent Schema Registry
- Confluent REST Proxy
- Kafka AvroConnector - allows for easy integration between Avro messages in Kafka and external systems, see also: https://www.confluent.io/product/connectors/
Cons:
- Binary encoding means that schema is required to read the data. Binary data in streaming context does not have schema associated with it. This requires some way to map between a topic or a message, and the message's Avro schema. Every language or tool that wants to read from Kafka needs to know how to do this.
- Avro-JSON encoding is non standard; e.g. optional fields are encoded as a union type with null, and serialized as:
"optional_field": {"string": "a"}
instead of
"optional_field": null
- Event data sent from remote clients (browsers, apps, etc.) is likely to not be binary Avro, so we'd need to parse Avro-JSON at this point and convert to binary Avro (Confluent REST Proxy can do this).
- When exposing public events via EventStreams, we'd have to convert from binary Avro back to JSON, as the binary messages are not easily parseable.
Solution 2: JSONSchema
Pros
- We already use it
- 'Human readable' (it is just text)
- everyone knows how to parse JSON
- Don't need a special library just to deserialize a message (no external schema needed)
- mediawiki/event-schemas have had a lot of thought and work put into them
- We can be partly backwards compatible with current event systems which use JSON (EventLogging Analytics & EventBus), and more easily migrate usages of those systems to Modern Event Platform.
Cons
- messages are large ('schema' is shipped with every message in the form of string JSON keys for each field)
- JSON parsing is not efficient (lots of JSON text -> binary -> JSON text de/serialization)
- JSON (even with JSONSchema) is 'loosely' typed. E.g. No distinction between integers and floats (they are just numbers.)
- Full JSONSchema spec is too featureful and extensive for consistent type checking
- We'd have to use a very limited version of JSONSchema, e.g. no additionalProperties or patternProperties, etc.
- No existent Kafka Connect integration, we'd have to write this ourselves (PoC here).
- No generic existent Schema Registry, we'd need to write this ourselves or adapt the meta.wikimedia.org schema repo.
- Whatever schema repo solution we use, we'll likely need to do some custom work, as we also want a way of storing and editing metadata associated with schemas/topics (e.g. retention, sanitization policies, sampling rates, ownership, access ACLs etc.)
- No schema evolution constraints, other than the one we can impose -- we only allow additions of optional fields.
Recommendation
Solution 2: JSONSchema
My own opinion has waffled back and forth between these options over the last few years. (I even wrote a blog post that summarizes some of these at the beginning.) Given my recent investigations into the Kafka Connect codebase, my attempt at a JsonSchemaConverter PoC, and the interest in JSON / JSONSchema support in Kafka integration tools in the Kafka community, I now believe that JSONSchema is a better fit for WMF than Avro. It will require more engineering work up front to build a comprehensive system built on JSONSchema (since one already exists for Avro), but I believe the long term ease-of-use of such a system will outweigh he short term gains of not having to implement anything. I also believe we can build open source integration plugins via Kafka Connect, as well as a generic JSONSchema and topic-metadata repository in such a way that will be usable for more than just WMF's production environment.