Page MenuHomePhabricator

RFC: Modern Event Platform: Schema Registry
Closed, ResolvedPublic8 Estimated Story Points

Description

T185233: Modern Event Platform describes the high level components that will make up the Modern Event Platform program in FY2018-2019.
T201063: Modern Event Platform: Schema Repostories Is the parent task for Schema Registry planning and implementation work. It also collects the user stories/requirements that were collected in Q4 of FY2017-2018 during stakeholder interviews.

This RFC outlines the current state of our event schemas and repositories and some possible ways forward. The main outcome of this RFC should be to choose an implementation option, but also to have a good understanding of what features will be in an MVP verses future versions. This RFC is not intended as a design document.

How things work now

Eventlogging/Guide Has an in depth (and slightly out of date) description of how the original EventLogging system works, including the on wiki schema registry. EventBus has some more information on the evolution of this original idea in EventLogging for production (non-analytics) services.

EventLogging at WMF currently uses 2 separate schema repositories: one for analytics and one for production (via EventBus).

EventLogging Analytics Schemas

The analytics EventLogging schemas are all stored at https://meta.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=470 in the 'Schema' article namespace using the EventLogging Mediawiki Extension. This extension specfies a 'json' content format for the schema namespace, that uses the CodeEditor Mediawiki extension to allow JSONSchemas to be stored as article revisions. The extension also exposes a MW PHP API endpoint for consumption of the schemas.

Pros
  • Schemas are easy for wiki users to edit and discover
  • API availability for Schemas is as good as our production wiki sites are
  • Talk pages are useful for schema documentation
Cons
  • Each schema stored as a revision is not self contained. It is expected that each of these schemas are wrapped by the EventCapsule schema. The only place schema and event composition is done is by the server side EventLogging python codebase. This makes it difficult to discern what real 'version' of a schema a given event should be validated against, as it is made up of two different schemas.
  • Uses a pared-down draft 3 JSONSchema, which is very different than modern JSONSchema
  • No non-production environment for schema development or CI (e.g. Mediawiki-Vagrant uses meta.wikimedia.org)
  • No discovery of multiple versions of the same schema (revisions are not contiguous)
  • Schemas cannot be reused between multiple extensions
  • No schema evolution enforcement rules
  • No good schema guidelines for modern event systems
  • 1 to 1 mapping of schema -> stream event usage. Impossible to reuse schemas for multiple purposes
  • EventCapsule makes schemas Mediawiki specific and not adaptable to non Mediawiki systems
  • Talk pages are unstructured and cannot serve schema metadata via an API (except via hacky mediawiki templating)

EventLogging production (aka EventBus) schemas

EventLogging EventBus for production events uses a git repository of draft 4 JSONSchemas. This git repository is cloned wherever it is needed (EventBus production service nodes, Mediawiki-Vagrant, etc.) and used from the local file system. The EventBus service (itself a part of the EventLogging python codebase) is configured to read schemas from the local filesystem rather than reach out to the meta.wikimedia.org hosted schema repository. EventBus service is also configured to not 'encapsulate' its schemas using the EventCapsule. EventBus schemas are mapped to specific Kafka topics, making them reusable for different usages.

Pros
  • Schemas are not mediawiki specific
  • Schemas have been subjected to intense design discussions (AKA bike shedding :p)
  • Schemas all share a common generic meta data structure (not separate capsule)
  • Schemas are reviewed for backwards compatibility constraints
  • Git repository has commit based CI to enforce schema standards
  • Schema development works the same way as code development -- in a git repo with code review
Cons
  • no active schema API service
  • no GUI for Schema editing and discovery
  • no schema documentation outside of READMEs and field descriptions.
  • no stream configuration other than topic -> schema mapping.

What we want

...is described in the parent task T201063: Modern Event Platform: Schema Repostories. The aim is to unify both the production and analytics schemas into a single schema registry and service and to support schema development without needing to modify schemas in a centralized production schema registry instance.

In addition to a schema registry, we need a way to dynamically configure uses of schemas without having to go through SWAT or a MW deploy train. This functionality is described in T205319: Modern Event Platform: Stream Configuration. The Schema Registry originally included this functionality, but after more discussions is was decided to separate it into a separate service.

This RFC is about choosing a backend storage for schemas, and describing how such a storage would be used to disseminate schemas to clients who need them.

The Schema Registry also needs to be fronted by a service API (for querying schemas over HTTP) and by a GUI that allows product managers and analysts to browse and search schemas and fields. It is fine if this GUI only allows read-only access to schemas. A GUI will likely use this API to display schemas alongside of their stream configuration. This RFC is not about building this GUI.

Decision points

This RFC should be used primarily to decide on an implementation path for storage of schemas for the Schema Registry. 3 options are proposed: 1. Continue to use centralized Mediawiki database, 2. Use and adapt an existing open source (also centralized) schema registry, or 3. Continue to use git as we do for EventBus.

Option 1: Continue to use Mediawiki

The main argument in favor of this is that we already do it. Adapting the EventLogging extension to support the production (no EventCapsule) schema format would not be much work. However, the requirement of being able to develop schemas without modifying the production repository would be difficult to support with Mediawiki. To do so, would everyone need an install of Mediawiki with a local schema registry? I'm not sure how this would work.

Option 2: Adapt an existing schema registry

There are a few of these out there, but none of them support our use cases well. Most are either very Avro specific or come with a much larger (possibly bloated, depending on your perspective) software stack. Examples are:

  • Nakadi
    • JSONSchema, but comes with a full (bloated?) event proxy and streaming system. Possibly too opinionated.

All of these rely on centralized storage for schemas, so they suffer from the main con that the Mediawiki option does now too.

Option 3: Use git for schema storage, build read-only services that serve schemas from checked out git repo.

This would allow us to continue to use the git based schema storage we already do for EventBus. We'd extend the format and layout of the event-schemas repository to include other schema metadata, like ownership. It would also require us to build a schema service API (this might just be a HTTP filesystem endpoint) and to build something that serves these schemas in a GUI. The GUI could be anything, and might even fit into a Mediawiki extension, where analysts are already used to browsing for and documenting schemas.

In the existent event-schemas git repository, we don't use git for schema versioning. Instead, we create new schema versions as new files, and do versioning manually. This allows for all schemas to be accessed in the git HEAD, and gives us finer control over the versioning (documentation or comments don't require new schema versions). This is also simple. A schema service API could be as simple as a HTTP server file tree browser. A linked file hierarchy also is easily indexed for search, making it easier to build a comprehensive GUI.

Option 3 would continue to use the same versioning technique.

Using git for storage satisfies the development use case, and allows us to use the same tools for review and CI that we do for code now.

Recommendation

Option 3: Use git for schema storage

Event Timeline

Ottomata triaged this task as Medium priority.Aug 9 2018, 8:29 PM
Ottomata created this task.
Krinkle moved this task from Request IRC meeting to P1: Define on the TechCom-RFC board.
Krinkle subscribed.

Keeping in inbox until the next meeting per the triage process. But for the record: IRC meeting was requested in a thread on Wikitech-l.

fdans changed the point value for this task from 0 to 8.Aug 27 2018, 3:52 PM

There will be a discussion about this RFC on #wikimedia-office at 21:00 UTC today, that's in an hour and a bit. Sorry for the late notice.

Copy/pasting this comment for visibility:

Dan and I had been contemplating some implementation details around this and the event intake components. We were struggling with some conflicts between a few use cases, and so today had a meeting with Petr, Marko, Sam Smith and Joaquin to discuss.

Currently, the EventLogging meta.wikimedia.org based schema repository is centralized. There is a single installation, and all code in all environments are expected to use schemas in this centralized location. This allows for schemas to be editable in a GUI. They also then don't have to be synced anywhere else.

However, this means that all uses need to have their schemas created or modified in 'production', even if only under development. It also tightly couples production services to meta.wikimedia.org. When designing EventBus years ago, we decided to use a git repository to manage and distribute schemas. This got us the same code review and continuous integration processes we use for all other development. It also means that development environments like MW-Vagrant can clone the repository locally for development purposes.

Our meeting today was about schema and metadata storage possibilities around these two different options: centralized database vs. decentralized git repository. We can solve all use cases using the git repository except for one: GUI editable schemas. In the meeting today we all decided that as long as we can drop that one use case, using a git repository made the most sense. We'd still build GUI and API services that assist in schema and metadata browsing, searching, etc, but editing this data would have to be done via git.

I need to check with some other product owners to make sure this will be ok, but given that we'll plan to use git for schema storage, and also possibly for metadata storage too.

Ottomata renamed this task from RFC: Modern Event Platform: Schema Registry / Metadata Service to RFC: Modern Event Platform: Schema Registry.Aug 30 2018, 5:44 PM
Ottomata updated the task description. (Show Details)

I've edited the RFC description to account for:

  • add more info about schema usage metadata vs schema registry
  • modify RFC to be about git based storage for schemas
  • add info about file based versioning

@kchapman I'm not totally familiar with the RFC process. What's next? Does this need to go through another RFC IRC meeting?

To recap what I said in last week's IRC meeting: This kind of decision should not fall under TechCom's authority. It is not just a technical matter affecting software architecture and engineers' work, but also has a significant product aspects affecting users of the event data infrastructure, especially data analysts and product managers. Like any such work, it should evaluate the proposed solutions not just for their technical impact, but also based on how well they would work for those users and their needs.

21:10:12 <HaeB> I'm curious why this decision is considered to be in the realm of TechCom. It seems that it will have major impact not just on engineers/developers, but also on users of this data
21:11:01 <HaeB> I posted some comments here: https://phabricator.wikimedia.org/T201063#4543616

..

21:12:10 <HaeB> ...TLDR: it seems that the task and discussion has been conducted almost exclusively from an engineering perspective. As a data analyst, I don't see my perspective and needs represented

To follow up on a question @Krinkle raised in the RfC meeting:

21:17:49 <Krinkle> HaeB: However, I do genuinely want to know - is it common for you or other analysts to edit schemas on meta-wiki?

A quick look at who has edited schemas in the last 30 days (the maximum time visible in recent changes):

  • 3 data analysts
  • 1 product manager
  • 1 developer

Just had a shortish meeting with Tilman and Josh Minor. I don't think we resolved much, but Josh is going to work with Tilman and others via the Better Use of Data Working group to address some of Tilman's concerns. In the meantime, @Tbayer it'd be helpful if you were able to list your use cases that you are feel are not captured in T201063: Modern Event Platform: Schema Repostories and/or ones that you think using git for schema storage would make things difficult for.

To recap what I said in last week's IRC meeting: This kind of decision should not fall under TechCom's authority.

This kind of thing certainly should go through the a TechCom RFC. But TechCom approval is often not the only kind of approval necessary for a change. Changes should be approved in one way or another by all stakeholders. TechCom RFCs are a structured way to obtain approval from engineers. But engineers approving may well not be sufficient for a change to go ahead. I hope this clears up any confusion about TechCom's authority.

@daniel what's next for the RFC? How do I keep it moving?

@Milimetric what'S yoour take on this? Can this do to last call, or should it see more discussion?

I acknowledge my bias here, as a member of the team that's going to implement this feature. We have meetings to make sure we understand and include the use cases that Tilman's thinking of for the design of the metadata management part of this project. I believe that's the only remaining item, and there's a path to resolve it, so in my opinion, yes, this can go to last call as far as Tech Com is concerned. The decision to use git as storage seems noncontroversial and what happens with metadata seems to not interest TechCom too much, though obviously it should be handled with care.

What happens with metadata will probably interest TechComm (e.g. allowing remote clients to configure their sampling rates via a service), but we've removed that discussion from this RFC. I think this one can be moved along too.

Just had a shortish meeting with Tilman and Josh Minor. I don't think we resolved much, but Josh is going to work with Tilman and others via the Better Use of Data Working group to address some of Tilman's concerns.

Just to clarify for the record: That meeting didn't quite happen as planned, because some participants had to withdraw on short notice, so it was just an impromptu conversation between us three. There is a meeting happening today instead, although @JMinor will still be OOO, so we'll need to wait a bit longer for the "more thorough comment based on the Better Use of Data stories" he has been planning to provide (T201063#4543519).

In the meantime, @Tbayer it'd be helpful if you were able to list your use cases that you are feel are not captured in T201063: Modern Event Platform: Schema Repostories and/or ones that you think using git for schema storage would make things difficult for.

Neil has since pointed out, from the data analyst perspective, a major such use case that had been overlooked (T201063#4561468 ). Thanks for adding it to the task description there (T201063#4561468 )! I'll try to flesh it out a bit more.

Last week Dan and I met with the Product Analysts and discussed the question about using a git repository. My impression is that this is generally supported, as long as we have good systems and standards for documentation and schema review and merging (analysts should be able to merge changes to schemas).

Neil had a great idea about editing schemas in the git repository. We will likely not use git history for schema versioning. Instead, we will do as we do now in the event-schemas repository: make a new numerically versioned copy of each schema file every time we change the schema. Making a new copy of each file isn't great for code review, as it isn't easy to see what has changed. Neil's idea was to always edit a 'latest' version of a schema, and have a post commit or merge git hook that would create the new numerically versioned copy of the file and commit it. I'd love to have this in the event-schemas repository we use for production events now.

I'm documenting here a question from myself that came up in the IRC discussion hour and in TechCom meetings.

What does configuration story look like for MediaWiki in the Git-model?

The current model of Meta-Wiki as central registry is making development and iteration difficult because it requires online connectivity to create (draft) versions of the schema, where each revision is also permanently and publicly stored in a way that isn't easily distinguishable from "real" versions. It also makes code review difficult and requires online connectivity to fetch from WMF servers. This means a local installation with all software installed an setup isn't able to start up in a way you'd expect (e.g. when working on a plane, or otherwise offline, or with a fire-walled development environment).

The way software is usually developed via Git has the potential to makes this a lot easier so this isn't much a concern, rather I'm just curious what the plan is roughly to connect the dots between the schema registry and MediaWiki, and whether and how one would, for example, draft a commit in one repository in a way that MediaWiki-EventLogging can discover (e.g. will it require an API service and connect via hostname+port, or file path to git clone, or dedicated clone as submodule in the MW extension, something else, etc.).

whether and how one would, for example, draft a commit in one repository in a way that MediaWiki-EventLogging can discover (e.g. will it require an API service and connect via hostname+port, or file path to git clone, or dedicated clone as submodule in the MW extension, something else, etc.).

Committed and merged schemas will be publicly available in production via HTTP, likely via the Stream Config Service, or some standalone HTTP filesystem webserver. Your development environment would be configured to either read the schemas directly from the filesystem git checkout, or via a local install of the same service we deploy in production, configured to serve schemas from your local git checkout.

To develop a new schema or a schema change, you make all your changes to both schemas and code locally and test them. You then submit the changes for review. Schema changes will need to be merged (and probably auto deployed) before associated code changes are deployed, but since all schema changes will need to be backwards compatible this shouldn't be an issue.

In the EventBus system, all events have a meta.schema_uri field, which we populate with the relative URI path of a schema, e.g. mediawiki/revision/create/1. This URI can then prefixed with a base URI to form a fully addressable schema URL. If your schemas are checked out locally, the base URI could be something like: file:///srv/event-schemas/jsonschema. If you need to look them up from a remote HTTP server, then the base URI could be http://schemas.wm.org/jsonschema (or whatever).

moving to inbox so techcom members can review if needed, and move to last call if there are no further questions or problems to work out.

TechCom is placing this on Last Call ending Wednesday December 5th 10pm PST(December 6th 06:00 UTC, 07:00 CET)

awight subscribed.

I added a negative, that multiple extensions cannot use the same schema even when at the same revision level.