Page MenuHomePhabricator

Modern Event Platform: Schema Registry: Implementation
Open, NormalPublic0 Story Points

Description

Ticket proliferation disambiguration!

This ticket will be used to track and task implementation work for the Schema Registry.

Description

Since we are moving forward with git as the canonical storage of schemas, we can base implementation to be done for Q2 2018-2019 on the existing event-schemas repository. This repository currently contains Draft 4 JSON schemas with some minimal CI jobs to ensure schema consistency. Implementation work for this task will mostly be around git commit/merge hooks and CI improvements.

We also may want to build an HTTP service to serve schemas. If so, this service might be as simple as just an HTTP file server that exposes the git repository (or repositories) hierarchy and schemas.

In either case, schemas will always be addressable via URIs, whether those schemas are checked out on the local filesystem (file://) or via HTTP (http://).

Technical Requirements

  • Up to date JSONSchema support (Draft 7?)
  • All schema versions maintained in HEAD commit (we won't be using git history to version schemas)
  • CI for ensuring that schemas all have consistent meta field
  • CI for ensuring schema backwards compatibility
  • CI for schema linting, e.g. no camelCase, only snake_case, etc.
  • 'latest' schema version is editable and changes to it are reviewable using usual git review tools
  • Post commit or merge git hooks to create new versioned file copies of schemas
  • Schemas can be in YAML or JSON format, but files should not have file extensions so relative schema_uris don't need to include (or append) a proper file extension

Other ideas

On 2018-10-12, @Pchelolo and @Ottomata brainstormed implementation ideas. Much of the implementation work to be done is around CI and development workflows. Some of this is already done for mediawiki/event-schemas, but we need to do more. I'll try and collect some of the things we need to implement.

  • editing of schemas should be done to the current schema version.
  • JSON $ref pointers can be used only in the current schema version.
  • $ref pointers to other schemas must be strongly versioned. E.g. if we factor out the meta schema,
  • every event that uses it will point to a specific version of meta, e.g. meta/3, or meta/4.
    • versioned $ref pointers in schemas must be manually upgraded by editing the schema and creating a new schema version.
  • This will ensure that any changes to referenced schemas will not affect user schemas until they manually update the referenced version. (This is how dependencies normally work anyway.)
  • git hooks will dereference current to generate standalone explicitly committed versioned schema files.
  • next schema version number can be computed from upstream branch
    • e.g. if upstream origin/master has revision/create/3 as the latest, a change to revision/create/current will generate revision/create/4 for review. If local checkout of master has revision/create/4, but upstream origin/master still only has revision/create/4, a change to revision/create/current will regenerate revision/create/4.
  • if only a code comment or description field change in current schema, don't generate a new schema version.
  • backwards compatibility library T206889 ensure changes are backwards compatible in git hook and also CI.
  • Should we use smarter versioning than just incrementing numbers? Semver might be nice and more flexible, especially for those times when we need to force a backwards incompatible change.

Event Timeline

Ottomata created this task.Oct 11 2018, 6:59 PM
Ottomata triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2018, 6:59 PM
Ottomata added a comment.EditedOct 11 2018, 7:00 PM

Q: should we use the term 'repository' or 'registry' here? I'm considering retitling the tickets to 'repository' since we will be using git repositories. However, there may be some extra features on a potential HTTP service that serves schemas. If we have that, would we call that the 'registry'?

Q: Analytics has a use case to add extra jsonschema features to be able to know more about the contextual 'types' of fields, namely: dimension (low cardinality) vs measure (value), and also time dimensions. Having this context in schemas will allow us to automate ingestion into analytics systems like Druid, and also even Prometheus which makes the same distinctions between fields (labels vs values). Do we need to use a custom meta JSONSchema for this, or can we just add type information outside of the JSONSchema spec in the schemas? We'd want to do something like:

dt:
  type: string
  format: date-time
  context_tags: [dimension, time]
domain:
  type: string
  context_tags: [dimension]
buttons_clicked:
  type: integer
  context_tags: [measure]
Ottomata updated the task description. (Show Details)Oct 11 2018, 7:07 PM
Ottomata updated the task description. (Show Details)Oct 11 2018, 7:34 PM
Pchelolo updated the task description. (Show Details)Oct 11 2018, 9:12 PM

Do we need to use a custom meta JSONSchema for this, or can we just add type information outside of the JSONSchema spec in the schemas?

We would need to use custom meta-schema: http://json-schema.org/latest/json-schema-core.html#rfc.section.6.4

I'm wondering if the HTTP service should be able to serve both extended and standard schema depending on the accept header the client provided?

Up to date JSONSchema support (Draft 7?)

+1, but we need to evaluate whether most of the languages have good libraries with support for draft 7.

Speaking about node.js, the absolute best (based on testing/benchmarking from ~1.5 years ago) node JSON schema validator ajv supports it. However, this one actually builds JS code and evals it based on the schema, so now, since we're opening event production to the public, we need to conduct a security review of this lib.

Q: should we use the term 'repository' or 'registry' here? I'm considering retitling the tickets to 'repository' since we will be using git repositories. However, there may be some extra features on a potential HTTP service that serves schemas. If we have that, would we call that the 'registry'?

I'd stick with 'registry' as a name for "the service" whatever we include in this term, to free the term repository for speaking about the git repo itself. Reusing 'repository' for both can be confusing.

How're we satisfying the requirement of

As an engineer, I want to be able to share schemas in development so that others can run and test my code.

WE'd need to support branch URIs for that or do you have something else in mind?

I'm wondering if the HTTP service should be able to serve both extended and standard schema depending on the accept header the client provided?

I think it would be based on the value of the $schema field

+1, but we need to evaluate whether most of the languages have good libraries with support for draft 7.

AJV uses draft 7 by default. We don't need JSONSchema validation elsewhere, just JSON (schema) parsing.

WE'd need to support branch URIs for that or do you have something else in mind?

No, I think the local EventBus would just use whatever is checked out locally. So if someone wants to test out a new schema, they just checkout / cherry-pick / download whatever the patch or branch.

Ottomata renamed this task from Modern Event Platform: Event Schema Registry: Implementation to Modern Event Platform: Schema Registry: Implementation.Oct 25 2018, 1:48 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Oct 25 2018, 1:50 PM
Ottomata moved this task from Backlog to In Progress on the EventBus board.Dec 5 2018, 10:04 PM
Ottomata moved this task from In Progress to Next Up on the EventBus board.

@Pchelolo, so aside from the eventual HTTP based schema registry idea, we will still need (at least) one more git schema repository for analytics. This repo should use the same CI pipeline we build for event-schemas, but more people will have commit and merge access to it.

This quarter we want to start producing the monolog avro events (CirrusSearchRequestSet and ApiAction) to an eventgate instance. These events currently go through kafka-jumbo, and I think they should continue to do so. The eventgate-analytics deployment will (for now?) also just use kafka-jumbo. We need a place to store these new schemas. Perhaps mediawiki/event-schemas is not it? Should we create a new schema repo now for analytics purposes, or should we just use mediawiki/event-schemas for now and create a new repo later when it is time?

Should we create a new schema repo now for analytics purposes, or should we just use mediawiki/event-schemas for now and create a new repo later when it is time?

Creating a new one will result in premature bikeshedding on naming, structure of the repo etc.. I'm ok with using event-schemas for now.

Ottomata moved this task from Next Up to In Progress on the EventBus board.Apr 19 2019, 4:17 PM