This task will be the parent for work to build a schema repository component for the Modern Event Platform program. The working name of this component will be Event Schema Repositories.
In {T198256}, it was decided to continue using JSONSchema as we do now. As such, we need to either write new schema registry software, or find one to adapt to our needs. We've collected a lot of requirements and wishes from analysts, engineers and product managers for this component. I'll summarize those as user stories here. We can then discuss how to satisfy those stories in a particular implementation and design.
NOTE: EventLogging schemas are currently coupled with their usage. That is, any given schema can only have one usage (Kafka topic, MySQL/Hive table, etc.). We want to decouple schemas and their usage instances. A stream of events is a single 'usage' of a schema, in that every event in a stream will have the same schema. A schema may be used by multiple streams.
NOTE: This task originally described both the Event Schema Repository and the Stream Configuration Service components. Stream Configuration Service has been separated out and moved to {T205319}. See also: https://phabricator.wikimedia.org/T185233#4611779
# User Stories
### MVP
- As an **engineer**, I want to develop new code that uses schemas without committing changes to the production schema registry so that I don't endanger production during development.
- As an **engineer**, I want a queryable (read only) service API so that I can discover schemas
- As an **engineer**, I want each schema/(schema revision) to have a unique ID in a form of a publically accessible URI
- As a **data analyst** or **product manager**, I want a canonical place where I can easily draft schema definitions and implementation details in collaboration with product engineers during implementation ([[https://meta.wikimedia.org/w/index.php?title=Schema:PageIssues&action=history|example]]), document and access them once a schema is live, and correct and amend them later as needed.
### Future version
- As an **engineer**, I want strict and clear schema policies enforced so that I don't create event data that is difficult for consumer integration.
- As an **engineer**, I want enforcement of schema changes to be backwards compatible so that I don't break downstream consumers of events.
- As an **analyst**, I want clear analytics schema guidelines and conventions for schema design so that schemas are more consistent, maintainable and easier to collaborate on.
- As an **analyst/engineer**, I want clear analytics schema guidelines and conventions so that integration into analytics datastores and dashboards is easy.
- As an **engineer**, I want to be able to share schemas in development so that others can run and test my code.
- As an **engineer**, I want other Modern Event Platform components to function if the Schema service is offline (via cached schemas, local copies, etc.) so that event systems are reliable and highly available.
- As an **engineer**, I want to be able to reuse and reference schemas from one another using the aforementioned URI ID in order to avoid copy-pasting.
- As an **analyst/product manager** I want to able to search through existing schemas to find which data is being collected and how the data is defined in the event system.