RFC: Modern Event Platform - Choose Schema Tech
Open, NormalPublic

Description

T185233: Modern Event Platform describes the high level components that will make up the Modern Event Platform program in FY2018-2019.

In order to proceed with further system design RFCs, we need to take some time to decide on the type of event schemas we want to use.

How things work now

(Some of this is also summarized at https://wikitech.wikimedia.org/wiki/EventBus#Background)

EventLogging (for Analytics) has historically used different draft versions of JSONSchema to model events at different points in the event pipeline (see T46809 and T182094). It also requires that all schemas are stored in a single on-wiki format. These schemas are expected to be automatically wrapped in an EventCapsule. This EventCapsule is Mediawiki specific. Since the EventCapsule is not referred to by any event JSONSchema directly, it means that any particular event schema requires the python EventLogging codebase to stitch together the user submitted event data with capsule meta data. These schemas are tightly coupled to Mediawiki, meta.wikimedia.org, and our production installation of EventLogging server side processes.

The EventBus HTTP Proxy service (itself a piece of the python EventLogging codebase) addresses some of these problems for production events. Instead of using the Mediawiki & EventCapsule on-wiki draft 3 JSONSchemas, it uses a local checkout of the mediawiki/event-schemas repository. These schemas are all self contained, and went through a thorough extensive consistency design discussions (AKA bike-shedding :p). Instead of requiring an external service to wrap the event data with a capsule schema, all events contain a common generic meta subobject.

Motivations

The Modern Event Platform program's main motivations are:

  • To unify WMF's event processing infrastructure
  • Use best practices and existing open source tools (where possible) rather than writing our own
  • Ease of use for engineers

Note that we aren't particularly concerned with performance here, at least not for the remainder of this RFC. (We use JSON for webrequest logs and are doing just fine.)

Decision points

Before making other tech decisions pertaining to Modern Event Platform, we first need to decide on an event schema and serialization technology. Schemas and serialization are not necessarily coupled together, but the schema types we might consider all come with their own serialization formats and tooling, so we will consider both the schema and serialization choice together.

This decision comes down to choosing between JSONSchema with JSON, a text based serialization format, vs binary ones. We will not be considering text serializations other than JSON. There are multiple possible binary techs choices, including Thrift, Protobufs and Apache Avro. When comparing these to JSONSchema in a streaming context, many of the pros and cons are the same. There are of course differences, but the ones that matter for Modern Event Platform are around ease of use and integration with existing technologies. Avro is a relatively ubiquitous choice in the event/streaming world, and has built in support with stream processing platforms and datastore connectors. With this larger context, we will not be considering binary choices other than Avro.

Some detailed pros and cons of each solution are outlined below, but in general:

Avro would allow us to make use of existent open sources technologies and strongly typed data, but comes with the pain of dealing with streaming binary data.

JSONSchema would require us to do more integration implementation work now and be less confident in our loosely typed data integration. JSON is also much easier for downstream clients to consume.

Note on internal vs external usage: we will continue to use JSON for serialization of external events. This includes intake of EventLogging style events and publishing of events in EventStreams. If we choose a binary schema/serialization, we will need to do conversion for both incoming and outgoing events. While very possible to do, this goes against the ease of use motivation for Modern Event Platform.

Solution 1: Apache Avro

Avro is a schema specification with binary encoding and strong compatibility constraints.

Pros:

  • compact, fast, binary data format, as well as a JSON text encoding
  • simple integration with dynamic languages
  • backwards and forwards schema compatibility and evolution rules (it is possible to read binary Avro with any version of a schema)
  • strongly typed for easy integration with typed and non typed languages and systems.
  • Confluent platform has many open source components available to work with Avro in a streaming context

Cons:

  • Binary encoding means that schema is required to read the data. Binary data in streaming context does not have schema associated with it. This requires some way to map between a topic or a message, and the message's Avro schema. Every language or tool that wants to read from Kafka needs to know how to do this.
  • Avro-JSON encoding is non standard; e.g. optional fields are encoded as a union type with null, and serialized as:
"optional_field": {"string": "a"}

instead of

"optional_field": null
  • Event data sent from remote clients (browsers, apps, etc.) is likely to not be binary Avro, so we'd need to parse Avro-JSON at this point and convert to binary Avro (Confluent REST Proxy can do this).
  • When exposing public events via EventStreams, we'd have to convert from binary Avro back to JSON, as the binary messages are not easily parseable.

Solution 2: JSONSchema

Pros

  • We already use it
  • 'Human readable' (it is just text)
  • everyone knows how to parse JSON
    • Don't need a special library just to deserialize a message (no external schema needed)
  • mediawiki/event-schemas have had a lot of thought and work put into them
  • We can be partly backwards compatible with current event systems which use JSON (EventLogging Analytics & EventBus), and more easily migrate usages of those systems to Modern Event Platform.

Cons

  • messages are large ('schema' is shipped with every message in the form of string JSON keys for each field)
  • JSON parsing is not efficient (lots of JSON text -> binary -> JSON text de/serialization)
  • JSON (even with JSONSchema) is 'loosely' typed. E.g. No distinction between integers and floats (they are just numbers.)
  • Full JSONSchema spec is too featureful and extensive for consistent type checking
    • We'd have to use a very limited version of JSONSchema, e.g. no additionalProperties or patternProperties, etc.
  • No existent Kafka Connect integration, we'd have to write this ourselves (PoC here).
  • No generic existent Schema Registry, we'd need to write this ourselves or adapt the meta.wikimedia.org schema repo.
    • Whatever schema repo solution we use, we'll likely need to do some custom work, as we also want a way of storing and editing metadata associated with schemas/topics (e.g. retention, sanitization policies, sampling rates, ownership, access ACLs etc.)
  • No schema evolution constraints, other than the one we can impose -- we only allow additions of optional fields.

Recommendation

Solution 2: JSONSchema

My own opinion has waffled back and forth between these options over the last few years. (I even wrote a blog post that summarizes some of these at the beginning.) Given my recent investigations into the Kafka Connect codebase, my attempt at a JsonSchemaConverter PoC, and the interest in JSON / JSONSchema support in Kafka integration tools in the Kafka community, I now believe that JSONSchema is a better fit for WMF than Avro. It will require more engineering work up front to build a comprehensive system built on JSONSchema (since one already exists for Avro), but I believe the long term ease-of-use of such a system will outweigh he short term gains of not having to implement anything. I also believe we can build open source integration plugins via Kafka Connect, as well as a generic JSONSchema and topic-metadata repository in such a way that will be usable for more than just WMF's production environment.

Ottomata created this task.Jun 26 2018, 8:02 PM
Restricted Application added a project: Analytics. · View Herald TranscriptJun 26 2018, 8:02 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata renamed this task from RFC: Modern Event Platform - Choose Schema Tech1 to RFC: Modern Event Platform - Choose Schema Tech.
demon added a subscriber: demon.EditedJun 29 2018, 2:03 AM

Have you looked into protocol buffers?

https://developers.google.com/protocol-buffers/

Ottomata added a comment.EditedJun 29 2018, 2:54 AM

Yeah, both protobufs and thrift are options, but neither have the advantages that Avro does, yet many of the same disadvantages.

Yeah, both protobufs and thrift are options, but neither have the advantages that Avro does, yet many of the same disadvantages.

The same is likely true for Cap'n Proto. However, Cap'n Proto also has the added disadvantage of poor JavaScript support – the only library I can find is Node.js only.

phuedx added a comment.EditedJun 29 2018, 9:09 AM

Full JSONSchema spec is too featureful and extensive for easy integration between different systems.

Could you go into more detail on this, particularly the latter part? AFAICT JavaScript, PHP, Python, and Java all have libraries for JSONSchema draft 6.

Ottomata added a comment.EditedJun 29 2018, 1:24 PM

AFAICT JavaScript, PHP, Python, and Java all have libraries for JSONSchema draft 6.

The languages do yes, but we want to easily import events into various downstream state stores. For that to work, we need schemas to be predicable, types to be static, and field names to be SQL compatible (no spaces, '.', other weird characters, etc.),

Basically, it should be easy to map from a given JSONSchema to a relational DB table. If we can do that, then we can support pretty much any state store.

… we need schemas to be predicable, types to be static, and field names to be SQL compatible (no spaces, '.', other weird characters, etc.),

Thanks for clarifying. The comment makes sense now.

fdans triaged this task as Normal priority.
Joe added a subscriber: Joe.Jul 4 2018, 8:16 PM
Joe added a comment.Jul 5 2018, 5:55 AM

Yeah, both protobufs and thrift are options, but neither have the advantages that Avro does, yet many of the same disadvantages.

I don't really agree with this assessment.

Protocol buffers are widely used, widely adopted and integrated in quite a few modern RPC technologies like gRPC, which is becoming an industry standard for RPC services. It also has implementations in most languages.

Thrift has great support for even more languages than protobufs, and has an RPC protocol embedded into it.

The only "advantage" I see in avro is... being Confluent's choice, which without more context seem like a pretty mundane advantage.

Also, AFAIR, while both thrift and protobufs allow reading outdated or newer messages coming from the wire, allowing thus easier migrations, I could not find any such functionality in avro's documentation, but admittedly it was a brief search.

All that said, I do think we should seriously consider at least Thrift as an alternative if our goal is to optimize size and serialization/deserialization speeds.

The only "advantage" I see in avro is... being Confluent's choice, which without more context seem like a pretty mundane advantage.

True, but it is also true that Confluent is the entity mostly driving the development of tooling around Kafka. I think the main point that the task wanted to make wrt Avro is their ready availability. That said, I agree that protobufs and Thrift shouldn't be discarded. For example, one of the cool tools that Confluent provides is Kafka Connect that allows you to connect Kafka with external systems such as databases, key-value stores, search indexes, and file systems. It natively uses Avro, but it allows you to define/implement your own (de)serialisation protocols. E.g. there is one for protobufs even though it's not supported by default. The other tool that Confluent provides is Schema Registry but, unfortunately, it seems that at the time of writing it supports only Avro schemae. This implicitly means that how will we manage our schemae depends on the seralisation technology we choose. That honestly seems wrong to me and should be decoupled from this discussion.

Also, AFAIR, while both thrift and protobufs allow reading outdated or newer messages coming from the wire, allowing thus easier migrations, I could not find any such functionality in avro's documentation, but admittedly it was a brief search.

For completeness,, Avro does support schema evolution.

All that said, I do think we should seriously consider at least Thrift as an alternative if our goal is to optimize size and serialization/deserialization speeds.

Also, there is nothing preventing us to have an internal format and convert between formats for the sake of clients.

For example, one of the cool tools that Confluent provides is Kafka Connect

Kafka Connect is an API and service that is part of Apache Kafka, not provided by Confluent

It natively uses Avro

Confluent's AvroConverter is built to work with Confluent's Schema Registry, but neither of these are part of Kafka Connect in Apache.

The only "advantage" I see in avro is... being Confluent's choice

It is Confluent's choice, but it is a very standard choice in the analytics / streaming / Hadoop world as well.

the main point that the task wanted to make wrt Avro is their ready availability

Indeed, that's the main point about all the lack of 'advantages' of Avro. Avro also has RPC features built in as well. But yes, the main advantage of Avro is the upfront new work we'd have to do (building schema registry, integrating with Kafka Connect, etc.) is much less than JSONSchema and also much less than Thrift of Protobuf.

All that said, I do think we should seriously consider at least Thrift as an alternative if our goal is to optimize size and serialization/deserialization speeds.

I think this is not our goal. Our goal (from my point of view) is ease of use and integration with various systems.

This implicitly means that how will we manage our schemae depends on the seralisation technology we choose. That honestly seems wrong to me and should be decoupled from this discussion.

We do need to consider best/common practices and existing tooling (including our own) when we make decisions like this.

Still, it might be wise to consider Thrift and Protobuf, especially for completeness. Perhaps we can first decide if we even want to use something other than JSON at all? If we don't, then I don't think we need to spend too much time evaluating these other binary formats.

Tgr added a subscriber: Tgr.Jul 8 2018, 5:19 PM
Ottomata updated the task description. (Show Details)Jul 10 2018, 4:45 PM

I've updated the task description with some recent comments from Marko and others. I've mentioned other binary type formats, but made an argument that given the goals of Modern Event Platform, we shouldn't consider options other than JSONSchema and Avro.

Reminder there is a meeting today August 1st at 2pm PST(22:00 UTC, 23:00 CET) in #wikimediaoffice

In todays RFC meeting, there was consensus to move forward with JSONSchema with strict policies about what is allowed (e.g. no additionalProperties) and how to do schema evolution.

IRC log is here: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-08-01-21.00.log.html

Thanks all!

kchapman moved this task from Request IRC meeting to Last Call on the TechCom-RFC board.EditedFri, Aug 10, 7:38 PM

This is being placed on Last Call closing August 22nd ending at 2pm PST(21:00 UTC, 23:00 CET)