T185233: Modern Event Platform describes the high level components that will make up the Modern Event Platform program in FY2018-2019.
T201063: Modern Event Platform: Schema Repostories Is the parent task for Schema Registry planning and implementation work. It also collects the user stories/requirements that were collected in Q4 of FY2017-2018 during stakeholder interviews.
This RFC outlines the current state of our event schemas and repositories and some possible ways forward. The main outcome of this RFC should be to choose an implementation option, but also to have a good understanding of what features will be in an MVP verses future versions. This RFC is not intended as a design document.
How things work now
Eventlogging/Guide Has an in depth (and slightly out of date) description of how the original EventLogging system works, including the on wiki schema registry. EventBus has some more information on the evolution of this original idea in EventLogging for production (non-analytics) services.
EventLogging at WMF currently uses 2 separate schema repositories: one for analytics and one for production (via EventBus).
EventLogging Analytics Schemas
The analytics EventLogging schemas are all stored at https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 in the 'Schema' article namespace using the EventLogging Mediawiki Extension. This extension specfies a 'json' content format for the schema namespace, that uses the CodeEditor Mediawiki extension to allow JSONSchemas to be stored as article revisions. The extension also exposes a MW PHP API endpoint for consumption of the schemas.
Pros
- Schemas are easy for wiki users to edit and discover
- API availability for Schemas is as good as our production wiki sites are
- Talk pages are useful for schema documentation
Cons
- Each schema stored as a revision is not self contained. It is expected that each of these schemas are wrapped by the EventCapsule schema. The only place schema and event composition is done is by the server side EventLogging python codebase. This makes it difficult to discern what real 'version' of a schema a given event should be validated against, as it is made up of two different schemas.
- Uses a pared-down draft 3 JSONSchema, which is very different than modern JSONSchema
- No non-production environment for schema development or CI (e.g. Mediawiki-Vagrant uses meta.wikimedia.org)
- No discovery of multiple versions of the same schema (revisions are not contiguous)
- Schemas cannot be reused between multiple extensions
- No schema evolution enforcement rules
- No good schema guidelines for modern event systems
- 1 to 1 mapping of schema -> stream event usage. Impossible to reuse schemas for multiple purposes
- EventCapsule makes schemas Mediawiki specific and not adaptable to non Mediawiki systems
- Talk pages are unstructured and cannot serve schema metadata via an API (except via hacky mediawiki templating)
EventLogging production (aka EventBus) schemas
EventLogging EventBus for production events uses a git repository of draft 4 JSONSchemas. This git repository is cloned wherever it is needed (EventBus production service nodes, Mediawiki-Vagrant, etc.) and used from the local file system. The EventBus service (itself a part of the EventLogging python codebase) is configured to read schemas from the local filesystem rather than reach out to the meta.wikimedia.org hosted schema repository. EventBus service is also configured to not 'encapsulate' its schemas using the EventCapsule. EventBus schemas are mapped to specific Kafka topics, making them reusable for different usages.
Pros
- Schemas are not mediawiki specific
- Schemas have been subjected to intense design discussions (AKA bike shedding :p)
- Schemas all share a common generic meta data structure (not separate capsule)
- Schemas are reviewed for backwards compatibility constraints
- Git repository has commit based CI to enforce schema standards
- Schema development works the same way as code development -- in a git repo with code review
Cons
- no active schema API service
- no GUI for Schema editing and discovery
- no schema documentation outside of READMEs and field descriptions.
- no stream configuration other than topic -> schema mapping.
What we want
...is described in the parent task T201063: Modern Event Platform: Schema Repostories. The aim is to unify both the production and analytics schemas into a single schema registry and service and to support schema development without needing to modify schemas in a centralized production schema registry instance.
In addition to a schema registry, we need a way to dynamically configure uses of schemas without having to go through SWAT or a MW deploy train. This functionality is described in T205319: Modern Event Platform: Stream Configuration. The Schema Registry originally included this functionality, but after more discussions is was decided to separate it into a separate service.
This RFC is about choosing a backend storage for schemas, and describing how such a storage would be used to disseminate schemas to clients who need them.
The Schema Registry also needs to be fronted by a service API (for querying schemas over HTTP) and by a GUI that allows product managers and analysts to browse and search schemas and fields. It is fine if this GUI only allows read-only access to schemas. A GUI will likely use this API to display schemas alongside of their stream configuration. This RFC is not about building this GUI.
Decision points
This RFC should be used primarily to decide on an implementation path for storage of schemas for the Schema Registry. 3 options are proposed: 1. Continue to use centralized Mediawiki database, 2. Use and adapt an existing open source (also centralized) schema registry, or 3. Continue to use git as we do for EventBus.
Option 1: Continue to use Mediawiki
The main argument in favor of this is that we already do it. Adapting the EventLogging extension to support the production (no EventCapsule) schema format would not be much work. However, the requirement of being able to develop schemas without modifying the production repository would be difficult to support with Mediawiki. To do so, would everyone need an install of Mediawiki with a local schema registry? I'm not sure how this would work.
Option 2: Adapt an existing schema registry
There are a few of these out there, but none of them support our use cases well. Most are either very Avro specific or come with a much larger (possibly bloated, depending on your perspective) software stack. Examples are:
- Confluent Schema Registry (with Landoop Schema Registry UI, example)
- Avro only, but has community desire for JSONSchema
- No metadata
- HortonWorks Registry
- Avro only, but could be extended.
- Has metadata / schema usage???
- Iglu Schema Repository
- JSONSchema, but only if conforming to a custom 'self describing JSONSchema' spec.
- Nakadi
- JSONSchema, but comes with a full (bloated?) event proxy and streaming system. Possibly too opinionated.
All of these rely on centralized storage for schemas, so they suffer from the main con that the Mediawiki option does now too.
Option 3: Use git for schema storage, build read-only services that serve schemas from checked out git repo.
This would allow us to continue to use the git based schema storage we already do for EventBus. We'd extend the format and layout of the event-schemas repository to include other schema metadata, like ownership. It would also require us to build a schema service API (this might just be a HTTP filesystem endpoint) and to build something that serves these schemas in a GUI. The GUI could be anything, and might even fit into a Mediawiki extension, where analysts are already used to browsing for and documenting schemas.
In the existent event-schemas git repository, we don't use git for schema versioning. Instead, we create new schema versions as new files, and do versioning manually. This allows for all schemas to be accessed in the git HEAD, and gives us finer control over the versioning (documentation or comments don't require new schema versions). This is also simple. A schema service API could be as simple as a HTTP server file tree browser. A linked file hierarchy also is easily indexed for search, making it easier to build a comprehensive GUI.
Option 3 would continue to use the same versioning technique.
Using git for storage satisfies the development use case, and allows us to use the same tools for review and CI that we do for code now.
Recommendation
Option 3: Use git for schema storage