Page MenuHomePhabricator

Figure out how to $ref common schema across schema repositories
Open, HighPublic

Description

We are mostly ready and able to create new schema repositories for uses other than production mediawiki events. However, while writing some documentation, I realized that we need to figure out how e.g. the analytics event schema repository will include the /common schema, currently defined in mediawiki/event-schemas repository.

Thus far, $refs in mediawiki/event-schemas are all confined to the local schema repository. When e.g. mediawiki/revision/create/current.yaml is materialized, the $ref's found are resolved relative to the configured schemaBaseUris, which defaults to the configured schemaBasePath, which is just the local working copy of the repository.

If we want a new 'analytics/event-schemas' (or whatever) repository to $ref: /common/1.0.0, we'll need a way for jsonschema-tools to resolve that $ref.

Ideas:

  • Remote common: $ref: https://schemas.wikimedia.org/repositories/mediawiki/jsonschema/common/1.0.0. This works, but kinda sucks because it centralizes the common schema. The common schema MUST be resolvable over http when dependent schemas are materialized.
  • Cached common: Create a /common/1.0.0 in analytics/event-schemas that has just $ref: https://schemas.wikimedia.org/repositories/mediawiki/jsonschema/common/1.0.0. This is slightly better in that other analytics/event-schemas schemas can then use a local (cached) $ref /common/1.0.0, but if any changes need to be made to the canonical common schema, it must be done in the remote repository, and then the local /common/1.0.0 must be materialized.
  • git submodule: Use a git submodule to checkout a common schema repository and use it in local $refs. This might be nice if we can get the directory hierarchy to be consistent in all schema repositories, so the resolved $ref paths work the same in all of them.
  • npm dependency: event schema repositories would use npm package.json to specify a dependency on common schema repositories they want to use. We could make postinstall we could make postinstall symlink the node_modules/wikimedia-common-schemas/jsonschema directory to something like ./repositories/wikimedia-common/jsonschema/, and then include that path in schemaBaseUris.
  • copy/paste Just copy/paste /common schema into each schema repository. This might be the simplest solution. We'd still have a greatly reduced copy/paste within any given repository; current schema would only be duplicated once per repository. This might be confusing for our relative $ref URIs though: /common/1.0.0 in analytics/event-schemas would not be exactly the same as /common/1.0.0 in mediawiki/event-schemas. A pro AND con of this is that each schema repository could define their own common schema. Perhaps this would be more flexible anyway. We could still add some tests (perhaps via jsonschema-tools) to ensure that the minimal common fields exist ($schema, meta.dt, meta.id, meta.stream).

Event Timeline

Ottomata created this task.Fri, Sep 20, 3:49 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Sep 20, 3:49 PM
Ottomata renamed this task from Figure out how to $ref common schema across schema repositorise to Figure out how to $ref common schema across schema repositories.Fri, Sep 20, 3:49 PM
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
Ottomata moved this task from Backlog to Next Up on the Event-Platform board.Mon, Sep 23, 2:58 PM
Ottomata triaged this task as High priority.Mon, Sep 23, 3:26 PM
Ottomata moved this task from Next Up to In Progress on the Event-Platform board.Mon, Sep 23, 6:18 PM

I think using git submodules would be the least confusing. In all the other options, there is some 'duplication' of either materialized version files, or even worse of current.yaml schemas.

If we did submodules, it'd be nice if we could have all event schemas use the same local URIs for common schemas. I think we can accomplish this with an extra URI in schemaBaseUris in event schema repositories. Example.

Let's say we create a new schema repository: wikimedia-common-event-schemas. Let's say it has the schema /wikimedia/common/1.0.0 at jsonschema/wikimedia/common/1.0.0.

We add the common git submodule in mediawiki/event-schemas at jsonschema/repositories/wikimedia. So the actual path to wikimedia/common schemas inside of mediawiki/event-schemas is ./jsonschema/repositories/wikimedia/jsonschema/wikimedia/common/1.0.0.

In .jsonschema-tools.yaml, we configure schemaBaseUris: [./jsonschema, ./jsonschema/repositories/wikimedia/jsonschema]. Then, when jsonschema-tools attempts to resolve $ref: /wikimedia/common/1.0.0, it will look inside of the checked out submodule.

I think this works and reduces the duplication of actual schema files in multiple git repositories.

@Pchelolo whatcha think?

Ottomata updated the task description. (Show Details)Tue, Sep 24, 3:35 PM

Added another idea to the task description: npm dependency. This would be effectively the same as a git submodule, in that to update the version of the common schema repository, you have to make a git commit to change the SHA or version checked out. I.e. package.json's dependency would have "wikimedia-common-event-schemas": "<git-url>@sha" (or "^X.Y.Z" if we published to npm).

I think this idea is functionally equivalent to the git submodule idea, it just uses npm rather than git submodules directly to get clone dependency repository. I'm not sure which of the two is the least confusing though.

I agree the npm and submodule ideas are the best two. I prefer the submodule idea, after working through what I think are likely scenarios

  1. using npm
    • user does $ref: /wikimedia/common/1.3.0
    • jsonschema-tools fails on git add because it can't find that file
    • remember to update npm package or read docs to re-learn
    • annoyingly long amount of time remembering how to update just one dependency if other npm dependencies aren't pinned
  2. using submodule
    • much more likely to have the most recent version because people keep their repository up to date
    • user does $ref: /wikimedia/common/1.3.0 and worst case it's not there
    • git submodule update --init

Just seems more natural and easier to me. Because the schemas are more content than code, and their versions might change quickly sometimes and slowly sometimes, and you almost always want the latest. If they are in npm I would expect I could pin a version of them in package.json and be fine for a year or so. So, my vote is for submodules.

Ok, submodules it is.

Next question: Should we create a new 'common wikimedia' schema repository and use it from both mediawiki/event-schemas and from any newly created repos? Or, should we just say that mediawiki/event-schemas is our 'main' repo and use it as a submodule in other repos?

Talked with @Milimetric today, we concluded that a new 'common' (to-be-named) repository that can be shared by both mediawiki/event-schemas and analytics/event-schemas makes sense.

This common repository would include definition of the current common schema (with meta and $schema), but also possibly be used as a 'data dictionary', where commonly used fields like page_title or rev_id could be defined and $refed. In this way, not only can common sub schemas be included, but individual fields themselves can defined and re-used (if desired).

As such, this common schema repository won't be just for event (platform) metadata like meta.dt and $schema, but also serve as a place for commonly used mediawiki & wikimedia field definitions.

Ok, time for naming bike shed!

We need to name both the new 'common' schema repository, as well as a new 'analytics' schema repository. 'common' and 'analytics' are working names, not necessarily to be used as the actual names.

The only existent repository is specifically for 'mediawiki' event schemas, and as such it is named mediawiki/event-schemas. Should we try to keep this repo namespacing convention? E.g. analytics/event-schemas, common/event-schemas or maybe better wikimedia-common/event-schemas? Or, should we make a new namespacing convention for schema repositories, like event-schemas/analytics, event-schemas/common? I think I prefer keeping all schema repositories in the same namespace.


First the 'common' bikeshed:
This repo will contain the common event schema. Additionally, it may also include other common pieces like the http field schema (e.g. in mediawiki/api/request. This repo might also be useable as a 'data dictionary' with canonical definitions of common fields like 'page_title'. I'm not sure if these would be long in an uber common repo or not, but it might get cumbersome to later on add another 'data dictionary' schema repository to contain such fields, if we decided we wanted to use JSONSchema fields for the data dictionary.

Name ideas for common repo:

  • event-schemas/common
  • event-schemas/wikimedia
  • event-schemas/wikimedia-common

Bikeshedding the name of the common repo will inform the choice of names for the 'analytics' schema repo. Let's keep that in mind but bikeshed that one after this one.

I like a single namespace, especially because having "common" as a root would be too vague. This might be useful:

  • schemas/event/common
  • schemas/event/wikimedia (better name than "mediawiki" for our schemas, in my opinion)
  • schemas/event/analytics

schemas/event/wikimedia (better name than "mediawiki" for our schemas, in my opinion)

Well, kinda? The schemas themselves do represent mediawiki specific events here.

Also, we will need to find a home for stuff like https://gerrit.wikimedia.org/r/c/mediawiki/event-schemas/+/541563

This is a non mediawiki and non analytics type event! I think for now this could go in the analytics schema repo since it will be mostly used for that purpose, but I could see in the future someone wanting to emit some non mediawiki type event for use with a production service feature.

I wouldn't worry too much about how others are using this in their own mediawiki installs. Not that it's not important, just that we can't possibly guess as to how they might want to do that. Just having three repos with some common stuff will allow for plenty of flexibility and refactoring later on.

I care more about putting schemas all in one place (like schemas/event or event-schemas/ or similar) so we can easily find them than I do about the specific repo names.

Re the 'mediawiki schemas' vs 'wikimedia-schemas', I kind of agree with you, but I think @Pchelolo does not! They are used by change prop to update RESTbase stuff, and they wanted all that stuff to be usable outside of Wikimedia. :p

I care more about putting schemas all in one place

I agree, but I think it is unlikely that we will rename mediawiki/event-schemas. Hm. Or could we?