Page MenuHomePhabricator

Modern Event Platform: Schema Registry
Open, NormalPublic0 Story Points

Description

This task will be the parent for work to build a schema repository component for the Modern Event Platform program. The working name of this component will be Event Schema Registry.

In T198256: RFC: Modern Event Platform - Choose Schema Tech, it was decided to continue using JSONSchema as we do now. As such, we need to either write new schema registry software, or find one to adapt to our needs. We've collected a lot of requirements and wishes from analysts, engineers and product managers for this component. I'll summarize those as user stories here. We can then discuss how to satisfy those stories in a particular implementation and design.

NOTE: EventLogging schemas are currently coupled with their usage. That is, any given schema can only have one usage (Kafka topic, MySQL/Hive table, etc.). We want to decouple schemas and their usage instances. A stream of events is a single 'usage' of a schema, in that every event in a stream will have the same schema. A schema may be used by multiple streams.
NOTE: This task originally described both the Event Schema Repository and the Stream Configuration Service components. Stream Configuration Service has been separated out and moved to T205319: Modern Event Platform: Stream Configuration. See also: https://phabricator.wikimedia.org/T185233#4611779

User Stories

MVP

  • As an engineer, I want to develop new code that uses schemas without committing changes to the production schema registry so that I don't endanger production during development.
  • As an engineer, I want a queryable (read only) service API so that I can discover schemas
  • As an engineer, I want each schema/(schema revision) to have a unique ID in a form of a publically accessible URI
  • As a data analyst or product manager, I want a canonical place where I can easily draft schema definitions and implementation details in collaboration with product engineers during implementation (example), document and access them once a schema is live, and correct and amend them later as needed.

Future version

  • As an engineer, I want strict and clear schema policies enforced so that I don't create event data that is difficult for consumer integration.
  • As an engineer, I want enforcement of schema changes to be backwards compatible so that I don't break downstream consumers of events.
  • As an analyst, I want clear analytics schema guidelines and conventions for schema design so that schemas are more consistent, maintainable and easier to collaborate on.
  • As an analyst/engineer, I want clear analytics schema guidelines and conventions so that integration into analytics datastores and dashboards is easy.
  • As an engineer, I want to be able to share schemas in development so that others can run and test my code.
  • As an engineer, I want other Modern Event Platform components to function if the Schema service is offline (via cached schemas, local copies, etc.) so that event systems are reliable and highly available.
  • As an engineer, I want to be able to reuse and reference schemas from one another using the aforementioned URI ID in order to avoid copy-pasting.
  • As an analyst/product manager I want to able to search through existing schemas to find which data is being collected and how the data is defined in the event system.

Related Objects

StatusAssignedTask
OpenOttomata
OpenOttomata
OpenOttomata
OpenOttomata
ResolvedOttomata
ResolvedPchelolo
ResolvedPchelolo
ResolvedOttomata
DeclinedNone
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedayounsi
ResolvedOttomata
OpenOttomata
OpenNone
DuplicateOttomata
DuplicateOttomata

Event Timeline

Ottomata triaged this task as Normal priority.Aug 2 2018, 6:19 PM
Ottomata created this task.
  • As an engineer I want each schema/(schema revision) to have a unique ID in a form of a publically accessible URI
  • As an engineer I want to be able to reuse and reference schemas from one another using the aforementioned ID in order to avoid copy-pasting the code.
Ottomata added a comment.EditedAug 7 2018, 1:39 PM

The features listed here have definitely grown from what I had originally considered when I proposed this program. That's fine and good! But it does make this component more complicated. Not only does it need to host schemas and a little bit of topic config, it needs to be an editable GUI that will dynamically change the behavior of remote clients (e.g. sampling rate, user bucketing, etc.). I think we need to separate some of these uses into two categories:

  • MVP - what is required for internal/production.
  • 1.0 - what is required for product analytics.

I think we can build an MVP that is usable for internal/production in the short term, while still planning for and building the full 1.0 within this annual program.

We also need to make some technical decisions on how this service will be built. It is really just a website GUI and API (highly available for remote clients). Many of the features of the GUI are like a wiki (editable schemas and configs with versioned history), so it might make sense to extend or rebuild the existing meta.wikimedia.org schema repository. On the other hand, it would be nice to decouple this from Mediawiki and make it a more general open-source project. There's certainly desire for this in the larger kafka/streaming community.

Ottomata added a comment.EditedAug 7 2018, 1:40 PM

it needs to be an editble GUI that will dynamically change the behavior of remote clients

Hm, it also might make sense to separate the schema registry and the metadata and schema->topic mapping into different services. Needs more discussion.

Ottomata updated the task description. (Show Details)Aug 9 2018, 4:59 PM

Dan and I had been contemplating some implementation details around this and the event intake components. We were struggling with some conflicts between a few use cases, and so today had a meeting with Petr, Marko, Sam Smith and Joaquin to discuss.

Currently, the EventLogging meta.wikimedia.org based schema repository is centralized. There is a single installation, and all code in all environments are expected to use schemas in this centralized location. This allows for schemas to be editable in a GUI. They also then don't have to be synced anywhere else.

However, this means that all uses need to have their schemas created or modified in 'production', even if only under development. It also tightly couples production services to meta.wikimedia.org. When designing EventBus years ago, we decided to use a git repository to manage and distribute schemas. This got us the same code review and continuous integration processes we use for all other development. It also means that development environments like MW-Vagrant can clone the repository locally for development purposes.

Our meeting today was about schema and metadata storage possibilities around these two different options: centralized database vs. decentralized git repository. We can solve all use cases using the git repository except for one: GUI editable schemas. In the meeting today we all decided that as long as we can drop that one use case, using a git repository made the most sense. We'd still build GUI and API services that assist in schema and metadata browsing, searching, etc, but editing this data would have to be done via git.

I need to check with some other product owners to make sure this will be ok, but given that we'll plan to use git for schema storage, and also possibly for metadata storage too.

@JMinor Adam suggested I ask you about ^. Feel free to respond here or schedule a meeting with me and/or Dan. :)

Our meeting today was about schema and metadata storage possibilities around these two different options: centralized database vs. decentralized git repository. We can solve all use cases using the git repository except for one: GUI editable schemas. In the meeting today we all decided that as long as we can drop that one use case, using a git repository made the most sense. We'd still build GUI and API services that assist in schema and metadata browsing, searching, etc, but editing this data would have to be done via git.

FWIW I'm a fan of this solution for authoritative data

+1 also on using git for edition

  • As an analyst, I want clear analytics schema guidelines and conventions for schema design so that schemas are more consistent, maintainable and easier to collaborate on.

How is this blocked on implementing a (new) schema registry? Isn't it more important to actually write up and document those guidelines (e.g. by moving https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines out of draft status - which speaking from recent experience on T202751 would be valuable).

  • As a product manager/analyst/engineer, I want to set the privacy whitelist settings of schema topic usage event fields so that I can retain non-PII data for longer than 90 days.

Here too we should be clear that this is already possible and widely practiced in the current system.

  • As an analyst, I want to know the schema, sampling, and other metadata settings that an event was emitted with so that I can account for these changes in analysis.

The schema is already recorded with all events. Recording the sampling rate and sampling method with every event is an important idea for which I can see various benefits (and some downsides), but how would it depend on the new schema registry? Wouldn't it actually rather be about moving some of kind of information out of the central schema documentation and into the individual event logs?

  • As a product manager/analyst/engineer, I want to set and discover the ownership of schemas and schema topic usages so I can track governance over time and know when a schema topics usage can be decommissioned.

The SchemaDoc template currently in use for the Meta-wiki based system does exactly that already for ownership. One simply consults the talk page of the schema (example) and looks for the "Maintainers" field.

@Ottomata I've taken a first pass look, but will be doing a more thorough comment based on the Better Use of Data stories in the next week. This is a great start, though, and much of it is already well covered...

Ottomata added a comment.EditedAug 29 2018, 8:23 PM

How is this blocked on implementing a (new) schema registry?

It isn't totally blocked, but there may be new implementation details about things like json $ref pointers and/or using meta sub-objects instead of the EventCapsule. Many of the analyst driven conventions could work for both existing EventLogging schemas and new system , but ones that have to do with backend auto-ingestion and aggregation are informed by the tech choices we make.

Wouldn't it actually rather be about moving some of kind of information out of the central schema documentation and into the individual event logs?

Ya it might be!

The SchemaDoc template currently in use for the Meta-wiki based system does exactly that already for ownership. One simply consults the talk page of the schema (example) and looks for the "Maintainers" field.

A schema does not map one to one with a schema-usage. It is one to many. So while we need to track ownership/authorship of schemas, we also need to track ownership of schema-usage.

In general, it's good that we track these requirements as we think about a new system, even if some are already satisfied in the current system.

On a general note:

The future schema registry (or currently the schema pages system on Meta-wiki) is a very important interface between engineers on the one hand (both analytics/backend and product/frontend) and data users (data analysts, PMs) on the other hand.

Yet the task here and the surrounding discussion seem to be written and conducted almost exclusively by engineers. I see some user stories for the second group in the task, but 1.) these don't necessarily correspond to the most important needs, 2.) as discussed above, much of these just describe current practice and capabilities and don't support the need for a switch to a new platform 3.) they all seem to be relegated to the "Future version", with the MVP focusing on the engineering studies.

Ottomata renamed this task from Modern Event Platform: Schema Registry to Modern Event Platform: Schema Registry + Schema Usage Metadata Configuration Service.Aug 30 2018, 5:25 PM

1.) these don't necessarily correspond to the most important needs
2.) as discussed above, much of these just describe current practice and capabilities and don't support the need for a switch to a new platform
3.) they all seem to be relegated to the "Future version", with the MVP focusing on the engineering studies

The larger picture does justify it though. The important needs that you see are from an analyst perspective, and indeed, the ones we see are from an engineering perspective. That is why we spent all of Q4 2018 interviewing engineers, analysts, researchers and product managers. The user stories were collected from those interviews.

However, in order to support the needs of all parties, we need to first build the foundation of a new system. That is why the MVP doesn't include many of the fancy user stories in it. Those are easier features to build once the core of the system (storage, APIs, etc.) are in place.

In yesterdays meeting for T201643: RFC: Modern Event Platform: Schema Registry, we decided to split out the Schema Usage Configuration Service from the Schema Registry component, at least for RFC purposes. Before I make an RFC for a configuration service, I'd like to have a little more back and forth discussion here (perhaps a meeting?) to flush it out a bit more.

The Schema Usage Configuration Service needs to do the following:

  • Configure particular usages of a schema. This includes things like
    • ownership e.g. Discovery team owns topic search-index-resource-change (which uses schema resource-change).
    • time bounded usage e.g. user-click-experiments should only send events between Sept. 1 and Sept. 30 2018.
    • sampling settings e.g. user-click-experiments should group logged in users into 100 buckets and only send events from user in the first bucket.
    • retention settings e.g. user-click-experiments should purge the username but keep all other fields after 90 days

With the exception of topic -> schema mapping, and possibly ownership, product folks want to be able to easily change the schema usage configuration dynamically, without having to do a SWAT or MW train deploy. They don't necessarily want these things to be editable in a GUI, but they do want their engineers/analysts to be able to change these settings at will. For example, if a sampling rate is changed, say from 1/1000 1/100, clients should start sampling differently.

The decentralization we get from using git for schema storage will help a lot with development use cases. However, it might not be as necessary to have a decentralized storage for this type of configuration. I could see a centralized configuration storage database/service where these things are modified. I could also see all of this configuration living in git. Either would be fine. In either case there will need to be a read-only GUI that allows product managers to know what e.g. sampling settings are at any given time.

@Krinkle yesterday you had some thoughts about this service. I know you were worried about forcing clients to phone home, but I think that could be optional. Not every schema-usage will need to have its client's dynamically configured. $wgEventLoggingPhoneHome = false :) Also, perhaps MW (or whatever) could do the phoning-home to get configuration when rendering the page load response, instead of having the client send a separate request via Javascript later?

Nuria added a comment.Aug 30 2018, 7:16 PM

Also, perhaps MW (or whatever) could do the phoning-home to get configuration when rendering the page load response, instead of having the client send a separate request via Javascript later?

Right, so, say "schema settings" (such as "sampling") could be bootstrapped on the page upon rendering, as the mw code will retrieve those from say storage X as page is composed. Once those settings are "bootstrapped" they are used on that pageview as needed. A subsequent pageview (full rendering) will update those if needed (simplyfying here as not every pageview can hit storage everytime but so to stress the point that retrieval happens upon page composition inside the document)

I think this is similar to how mw.config works now other than mw.config is backed by git and deliver via javascript request. The "schema settings" can also be delivered in this way, the difference is that as envisioned on this system they are not being edited in git. Also, of course, these two ideas have the drawback of adding weight to first page load and it is probably the case that you have tried to reduce page load forever now. Another idea is to send these "schema setting"s back piggy backing on the load requests (or other) in a custom header reserved for this purpose like "WMF-Schema- Settings:<some-json-here>" where WMF-Schema-Settings is the custom header, that way we save on page weight and round trips.

the difference is that as envisioned on this system they are not being edited in git.

Not necessarily, they could still be in git, just not mw-config for SWAT deployment.

1.) these don't necessarily correspond to the most important needs
2.) as discussed above, much of these just describe current practice and capabilities and don't support the need for a switch to a new platform
3.) they all seem to be relegated to the "Future version", with the MVP focusing on the engineering studies

The larger picture does justify it though.

I wasn't commenting on how to ultimately assess the tradeoffs between what engineers want and what data users want - simply saying that the latter needs to be evaluated too. If after doing that there would be a collective decision in the end saying that the former has to be prioritized over the latter for the time being, I would obviously not be happy either, but at least it would be an informed decision.

The important needs that you see are from an analyst perspective, and indeed, the ones we see are from an engineering perspective. That is why we spent all of Q4 2018 interviewing engineers, analysts, researchers and product managers. The user stories were collected from those interviews.

I appreciate that these interviews were being conducted, but as mentioned in last week's IRC meeting: I participated too and recall them being about the proposed new platform and EventLogging in general - without discussing user stories about the existing on-wiki systems or requirements for a schema registry, and certainly without evaluating the concrete proposals that are now on the table. In any case, I haven't seen any data users (analysts or PMs) in the discussion here or in the RfC meeting (apart from Josh saying that he will weigh in later), only engineers. That's a clear indicator that the present approach hasn't been working for the purpose of making sure data users' needs are represented.

However, in order to support the needs of all parties, we need to first build the foundation of a new system. That is why the MVP doesn't include many of the fancy user stories in it. Those are easier features to build once the core of the system (storage, APIs, etc.) are in place.

But the present decisions here surely shape what will be possible afterwards.

In any case, I haven't seen any data users (analysts or PMs) in the discussion here or in the RfC meeting (apart from Josh saying that he will weigh in later), only engineers. That's a clear indicator that the present approach hasn't been working for the purpose of making sure data users' needs are represented.

Not necessarily. I've been subscribed to this task since the beginning, and I haven't commented mainly because the proposal generally seemed sensible to me and I didn't see anything important I could add.

This isn't to say your concerns aren't valid. You've worked with EventLogging a lot more than me, and you're doing the rest of us analysts a service by bringing that experience to the discussion here.

But I would really suggest you focus more on the substance of the proposal and less on the process. Or perhaps you could propose concrete steps: for example, perhaps @Ottomata could commit to presenting the plans to the Product Analytics team before they're final?

Neil_P._Quinn_WMF added a comment.EditedSep 6 2018, 12:20 AM

Actually, @Ottomata, I did think of one additional use case:

  • As a data analyst, I want a canonical place where I can easily document schema definitions and implementation details.

There are a lot of random implementation details that crop up around a complex schema. For example, with the Edit schema:

  • The visual editor logs an init event, but the 2010 wikitext editor doesn't (see T203619—I don't understand why quite yet but you bet I'll want to document it once I do!)
  • The mobile editors generally won't log abort events because, unlike on desktop, the onunload handler isn't reliable.
  • We determine whether the platform is phone or desktop based on the skin used, not on any fact about the device.

Without a canonical place to document these, anyone analyzing the data will essentially have to discover these from scratch. And ideally, this canonical place won't be comments in a git-tracked schema file. Code files aren't good places for long-form documentation, and the write-commit-review workflow raises the barrier enough that most of the time people just won't bother to document things they learn.

A wiki page could be a good place, as long as there's a canonical page for each schema and as long as that page is discoverable from the schema registry and config repository. But other options might be possible.

Ya makes a lot of sense. In addition to high level type documentation you are mentioning, the Better Use of Data Working group is calling out a need for what they are calling a 'data dictionary'. This might also end up being wiki documentation with good conventions, but more likely it will end up being something more queryable. I agree it makes sense to document these things in a canonical place separate from the schemas, especially since the schemas will no longer be mapped directly to their usages.

There's a larger concept of schema-usage metadata and configuration here, that ties in a lot to schema governance. Some of the feedback I got in during the Schema Registry RFC meeting was that I needed to separate out the metadata/configuration use from the schema storage itself. That's been done, and now the RFC task is all about canonical storage of schemas, but not at all about how those schemas are browsed, discovered, configured for use (sampling, topic mapping, etc.), or documented. Who knows, that may very well still be on a wiki (with read-only schemas).

Ottomata updated the task description. (Show Details)Sep 6 2018, 12:33 AM
Tbayer updated the task description. (Show Details)Sep 17 2018, 3:23 PM

In any case, I haven't seen any data users (analysts or PMs) in the discussion here or in the RfC meeting (apart from Josh saying that he will weigh in later), only engineers. That's a clear indicator that the present approach hasn't been working for the purpose of making sure data users' needs are represented.

Not necessarily. I've been subscribed to this task since the beginning, and I haven't commented mainly because the proposal generally seemed sensible to me and I didn't see anything important I could add.

I see that changed shortly afterwards (T201063#4561468 ) ;) Thanks for adding this important use case, I just tried to flesh it out a bit more.

This isn't to say your concerns aren't valid. You've worked with EventLogging a lot more than me, and you're doing the rest of us analysts a service by bringing that experience to the discussion here.
But I would really suggest you focus more on the substance of the proposal and less on the process.

The process clearly shapes the outcome. I appreciate @Ottomata's willingness to listen to your and my input, but the fact of the matter is that you and I can't speak for all users either (at least not without some more thorough product research that normally is done by people driving such a project themselves). E.g. I understand @JMinor is still going to provide substantial input from the PM perspective (T201063#4543519). And the new instrumentation DACI envisages several other kind of people who will need to access and review schemas.

Or perhaps you could propose concrete steps: for example, perhaps @Ottomata could commit to presenting the plans to the Product Analytics team before they're final?

That might be a good idea! @Ottomata, any thoughts on this?

@Ottomata could commit to presenting the plans to the Product Analytics team before they're final?

Sure! Can you invite me to your the Product Analytics team meeting next Tuesday the 25th?

As a data analyst or product manager, I want a canonical place where I can easily draft schema definitions and implementation details in collaboration with product engineers during implementation (example), document and access them once a schema is live, and correct and amend them later as needed.

Hm, I'd split this up into different use cases. How about:

  • As a data analyst, I want a place where I can easily draft schema definitions and implementation details in collaboration with product engineers during implementation

and Neils original:

  • As a data analyst/product manager, I want a canonical place where I can easily document schema definitions and implementation details.

I left out the product manager user in the one about editing/drafting schemas, only because the few I've talked to haven't had the need to edit them. We can ask around and see if I missed that and there is actually a use case for them here.

Sure! Can you invite me to your the Product Analytics team meeting next Tuesday the 25th?

Oh and invite @Milimetric too plz :)

Sure! Can you invite me to your the Product Analytics team meeting next Tuesday the 25th?

Oh and invite @Milimetric too plz :)

Done and done! Thanks for taking the time 😁

Ottomata renamed this task from Modern Event Platform: Schema Registry + Schema Usage Metadata Configuration Service to Modern Event Platform: Event Schema Repository.Sep 24 2018, 6:29 PM
Ottomata renamed this task from Modern Event Platform: Event Schema Repository to Modern Event Platform: Schema Registry.
Ottomata renamed this task from Modern Event Platform: Schema Registry to Modern Event Platform: Event Schema Registry.
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Tbayer added a comment.EditedSep 25 2018, 6:34 PM

As a data analyst or product manager, I want a canonical place where I can easily draft schema definitions and implementation details in collaboration with product engineers during implementation (example), document and access them once a schema is live, and correct and amend them later as needed.

Hm, I'd split this up into different use cases. How about:

  • As a data analyst, I want a place where I can easily draft schema definitions and implementation details in collaboration with product engineers during implementation

and Neils original:

  • As a data analyst/product manager, I want a canonical place where I can easily document schema definitions and implementation details.

They seemed to be closely connected to me, also because of the "correct and amend them later as needed" part. But we can split them if you prefer.

I left out the product manager user in the one about editing/drafting schemas, only because the few I've talked to haven't had the need to edit them. We can ask around and see if I missed that and there is actually a use case for them here.

Following up on the quick check earlier about who had actually edited schemas in the last 30 days (T201643#4560117 : 3 data analysts, 1 product manager, 1 developer) I ran a larger query for all users from the last 12 months.

While engineers and data analysts seem to form the majority, I see several product managers and also folks like @leila who have a different job title but may have been acting in a similar capacity here.

Source: https://quarry.wmflabs.org/query/29946
Includes talk page edits. Some duplication because of people using both their staff and volunteer accounts

mpopov added a subscriber: mpopov.

Added per our meeting:

As an analyst or product manager I want to able to search through existing schemas to find which data is being collected and how the data is defined in the event system.

Ottomata renamed this task from Modern Event Platform: Event Schema Registry to Modern Event Platform: Schema Registry.Oct 24 2018, 5:34 PM
Ottomata moved this task from Backlog to Parent Tasks on the Event-Platform board.Dec 5 2018, 10:06 PM
JMinor removed a subscriber: JMinor.Feb 6 2019, 9:28 PM
Ottomata edited projects, added Analytics; removed Analytics-Kanban.Mar 4 2019, 4:59 PM
Ottomata updated the task description. (Show Details)Jul 1 2019, 5:59 PM
Ottomata updated the task description. (Show Details)

While I would love to argue for a .NET deployment, so me and everyone I love can enjoy programming again, what do we need from schemastore? I didn't think we needed any fancy features outside of Stream Config