Page MenuHomePhabricator

Modern Event Platform: Schema Registry: Implementation
Open, MediumPublic13 Story Points

Description

Ticket proliferation disambiguration!

This ticket will be used to track and task implementation work for the Schema Registry.

Description

Since we are moving forward with git as the canonical storage of schemas, we can base implementation to be done for Q2 2018-2019 on the existing event-schemas repository. This repository currently contains Draft 4 JSON schemas with some minimal CI jobs to ensure schema consistency. Implementation work for this task will mostly be around git commit/merge hooks and CI improvements.

We also may want to build an HTTP service to serve schemas. If so, this service might be as simple as just an HTTP file server that exposes the git repository (or repositories) hierarchy and schemas.

In either case, schemas will always be addressable via URIs, whether those schemas are checked out on the local filesystem (file://) or via HTTP (http://).

Technical Requirements

  • Up to date JSONSchema support (Draft 7?)
  • All schema versions maintained in HEAD commit (we won't be using git history to version schemas)
  • CI for ensuring schema backwards compatibility
  • CI for schema linting, e.g. no camelCase, only snake_case, etc.
  • CI for schema field annotations (dimension vs measure, PII, etc.)
  • 'latest' schema version is editable and changes to it are reviewable using usual git review tools - T206812
  • Post commit or merge git hooks to create new versioned file copies of schemas - T206812
  • Schemas can be in YAML or JSON format, but files should not have file extensions so relative schema_uris don't need to include (or append) a proper file extension - T206812

Other ideas

On 2018-10-12, @Pchelolo and @Ottomata brainstormed implementation ideas. Much of the implementation work to be done is around CI and development workflows. Some of this is already done for mediawiki/event-schemas, but we need to do more. I'll try and collect some of the things we need to implement.

  • editing of schemas should be done to the current schema version.
  • JSON $ref pointers can be used only in the current schema version.
  • $ref pointers to other schemas must be strongly versioned. E.g. if we factor out the meta schema,
  • every event that uses it will point to a specific version of meta, e.g. meta/1.0.0, or meta/1.2.0.
    • versioned $ref pointers in schemas must be manually upgraded by editing the schema and creating a new schema version.
  • This will ensure that any changes to referenced schemas will not affect user schemas until they manually update the referenced version. (This is how dependencies normally work anyway.)
  • git hooks will dereference current to generate standalone explicitly committed versioned schema files.
  • schema version number is manually modified and set in current's $id field.
  • if only a code comment or description field change in current schema, don't generate a new schema version.
  • backwards compatibility library ensure changes are backwards compatible in git hook and also CI.
  • Scheams versinoned with semver

Details

Related Gerrit Patches:

Related Objects

Event Timeline

Ottomata triaged this task as Medium priority.Oct 11 2018, 6:59 PM
Ottomata created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2018, 6:59 PM
Ottomata added a comment.EditedOct 11 2018, 7:00 PM

Q: should we use the term 'repository' or 'registry' here? I'm considering retitling the tickets to 'repository' since we will be using git repositories. However, there may be some extra features on a potential HTTP service that serves schemas. If we have that, would we call that the 'registry'?

Q: Analytics has a use case to add extra jsonschema features to be able to know more about the contextual 'types' of fields, namely: dimension (low cardinality) vs measure (value), and also time dimensions. Having this context in schemas will allow us to automate ingestion into analytics systems like Druid, and also even Prometheus which makes the same distinctions between fields (labels vs values). Do we need to use a custom meta JSONSchema for this, or can we just add type information outside of the JSONSchema spec in the schemas? We'd want to do something like:

dt:
  type: string
  format: date-time
  context_tags: [dimension, time]
domain:
  type: string
  context_tags: [dimension]
buttons_clicked:
  type: integer
  context_tags: [measure]
Ottomata updated the task description. (Show Details)Oct 11 2018, 7:07 PM
Ottomata updated the task description. (Show Details)Oct 11 2018, 7:34 PM
Pchelolo updated the task description. (Show Details)Oct 11 2018, 9:12 PM

Do we need to use a custom meta JSONSchema for this, or can we just add type information outside of the JSONSchema spec in the schemas?

We would need to use custom meta-schema: http://json-schema.org/latest/json-schema-core.html#rfc.section.6.4

I'm wondering if the HTTP service should be able to serve both extended and standard schema depending on the accept header the client provided?

Up to date JSONSchema support (Draft 7?)

+1, but we need to evaluate whether most of the languages have good libraries with support for draft 7.

Speaking about node.js, the absolute best (based on testing/benchmarking from ~1.5 years ago) node JSON schema validator ajv supports it. However, this one actually builds JS code and evals it based on the schema, so now, since we're opening event production to the public, we need to conduct a security review of this lib.

Q: should we use the term 'repository' or 'registry' here? I'm considering retitling the tickets to 'repository' since we will be using git repositories. However, there may be some extra features on a potential HTTP service that serves schemas. If we have that, would we call that the 'registry'?

I'd stick with 'registry' as a name for "the service" whatever we include in this term, to free the term repository for speaking about the git repo itself. Reusing 'repository' for both can be confusing.

How're we satisfying the requirement of

As an engineer, I want to be able to share schemas in development so that others can run and test my code.

WE'd need to support branch URIs for that or do you have something else in mind?

I'm wondering if the HTTP service should be able to serve both extended and standard schema depending on the accept header the client provided?

I think it would be based on the value of the $schema field

+1, but we need to evaluate whether most of the languages have good libraries with support for draft 7.

AJV uses draft 7 by default. We don't need JSONSchema validation elsewhere, just JSON (schema) parsing.

WE'd need to support branch URIs for that or do you have something else in mind?

No, I think the local EventBus would just use whatever is checked out locally. So if someone wants to test out a new schema, they just checkout / cherry-pick / download whatever the patch or branch.

Ottomata renamed this task from Modern Event Platform: Event Schema Registry: Implementation to Modern Event Platform: Schema Registry: Implementation.Oct 25 2018, 1:48 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Oct 25 2018, 1:50 PM
Ottomata moved this task from Backlog to In Progress on the Event-Platform board.Dec 5 2018, 10:04 PM
Ottomata moved this task from In Progress to Next Up on the Event-Platform board.

@Pchelolo, so aside from the eventual HTTP based schema registry idea, we will still need (at least) one more git schema repository for analytics. This repo should use the same CI pipeline we build for event-schemas, but more people will have commit and merge access to it.

This quarter we want to start producing the monolog avro events (CirrusSearchRequestSet and ApiAction) to an eventgate instance. These events currently go through kafka-jumbo, and I think they should continue to do so. The eventgate-analytics deployment will (for now?) also just use kafka-jumbo. We need a place to store these new schemas. Perhaps mediawiki/event-schemas is not it? Should we create a new schema repo now for analytics purposes, or should we just use mediawiki/event-schemas for now and create a new repo later when it is time?

Should we create a new schema repo now for analytics purposes, or should we just use mediawiki/event-schemas for now and create a new repo later when it is time?

Creating a new one will result in premature bikeshedding on naming, structure of the repo etc.. I'm ok with using event-schemas for now.

Ottomata moved this task from Next Up to In Progress on the Event-Platform board.Apr 19 2019, 4:17 PM

Change 525609 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Allow eventgate-analytics to get schemas from remote schema.svc if not present locally

https://gerrit.wikimedia.org/r/525609

Change 525617 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Increase date-time maxLength to 128 for all schemas

https://gerrit.wikimedia.org/r/525617

Change 525617 merged by Ottomata:
[mediawiki/event-schemas@master] Increase date-time maxLength to 128 for all schemas

https://gerrit.wikimedia.org/r/525617

Change 525609 merged by Ottomata:
[operations/deployment-charts@master] Allow eventgate-analytics to get schemas from remote schema.svc if not present locally

https://gerrit.wikimedia.org/r/525609

Ottomata updated the task description. (Show Details)Sep 4 2019, 7:11 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Sep 20 2019, 2:19 PM

Based on discussions in T233432: Figure out how to $ref common schema across schema repositories, I am considering creating 2 new git repositories:

  • event-schemas/common
  • event-schemas/analytics

It would be nice if we could also rename our existing schema repo to event-schemas/mediawiki (instead of mediawiki/event-schemas), but I'll leave that for some future task.

@Pchelolo @jlinehan @Neil_P._Quinn_WMF any preferences here?

Ottomata added a project: Analytics-Kanban.
Ottomata changed the point value for this task from 0 to 13.
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Just had a discussion with Jason and we have another idea.

The reason we need multiple schema repositories is really just for different merge rights permissions! Analysts need to be able to work like they do now on meta.wikimedia.org. They should be able to submit, review and merge their own schema changes.

mediawiki/event-schemas is used for production events, so we want to keep the merge rights pretty restricted. It currently has more than just 'mediawiki' things too.

Perhaps we should name these repos according to their reason for existence; not after what they might contain?

  • event-schemas/analytics
  • event-schemas/production

(Or something like that?)

analytics could $ref schemas in $production, but not the other way around. This would mean we'd only (ever?) need 2 schema repositories. It would also make it easier to figure out where a schema belongs. If it is used for a production 'feature', it belongs in production. If it is used for analytics purposes (only) it belongs in analytics.

If we did this, we'd intend to move schemas currently in mediawiki/event-schemas into event-schemas/production, and eventually remove mediawiki/event-schemas altogether.

I'm not sure what @Pchelolo and @mobrovac think about this. I know one of the original intentions of EventBus and mediawiki/event-schemas and change-prop was to allow 3rd parties to set up MediaWiki with the same distributed production event processing WMF uses. I doubt this would actually be done, and is making it harder for us to structure our WMF specific code and schemas, but maybe I'm wrong?

That makes sense, I guess, since, as you point out, the main point of contention is write access (or the update workflow, to be more precise). We could set up an intricate ACL system and put it all in one repo, but that smells of over-engineering in this instance. As for the names, I'd suggest event-schemae/system and event-schemae/user.

Ok! Great.

I don't love the names system and user, as that isn't quite right. The schemas in the analytics repository will be pretty permanent and maintained schemas, it's just that they won't be used for building any production features. The schemas in the production repository will, but they can and will also be used for analytics too. It is production and non-production, really...

But I'm all for bikeshedding these names (as you know I always am!) so if there are other ideas out there that work let's find them. I keep trying to think of better ones but haven't thought of any.

Lol, I know you're always interested in discussions involving bikes of any type :P

On a more serious note, while I agree that currently access levels coincide with use cases, it doesn't mean that will always be the case. I'm advocating for the alternative naming so as to emphasise the access level rather than use cases, since in practice users will likely need to use schemae from both repos regardless of their names. That said, I don't feel that strongly about the names either way.

Ya, indeed. I think 'production' is an ok name, but you are right in that 'analytics' might not be very future proof. Production & ____ something else better I don't know what!

production scheams & instrumentation schemas?
;

I'm fine with just two repos, and the way you outline using them. Two thoughts on naming:

  • schema-event is better if we have other schema-* repositories in the future, like schema-database or something, so they sort together.
  • if a schema is in production it could imply it's deployed (like the production branch of the puppet repo). I would go with schemae-event/system or schemae-event/core and schemae-event/instrument. But I know I'm gonna keep forgetting and using schemas instead of schemae :)

I see nothing wrong with schemaS! Wait what is schemaE? A quick search just tells me its a latin plural. The plurals I see are schemas and schemata.

  • schemas/event/production
  • schemas/event/analytics

Kinda like you suggested earlier. Or we could be non-plural and prefix with schema/

I think the naming we are having trouble with though is the production vs analytics dichotomy. I'm ok with production, but analytics might not be quite right. We need a word that means 'non production'.

Alternatives for 'production' schema repo

  • primary
  • major
  • cardinal
  • main
  • essential
  • critical
  • vital
  • principal
  • crucial
  • indispensable

Alternatives for 'analytics' schema repo

  • secondary
  • minor
  • ancillary
  • supplementary
  • auxiliary
  • adjunct
  • unessential, inessential, nonessential (apparently these are all real words)
  • extrinsic
  • extraneous
  • extra
  • peripheral
  • optional
  • accessory

Some good pairs:

  • primary / secondary
  • primary / auxiliary
  • primary / supplementary
  • main / auxiliary
  • essential / nonessential
  • essential / supplementary

I think I like schemas/event/primary and schemas/event/supplementary the best. 'essential' is ok, but I'm not sure it is 100% descriptive. mediawiki/revision/score is not essential. 'supplementary' is nice because it is a positive sounding adjective, vs. something like unessential.

Supplemental is also a word. supplementary? supplemental. Oh man...english.

Whatcha think?

jlinehan added a comment.EditedTue, Nov 19, 7:56 PM

I'd like to see the schema repositories either bear a direct relation to the name of the EventGate instance that will probably be used, or bear no relation at all and be completely orthogonal (like @Ottomata's latest proposal attempts).

If the main concern with having distinct repos is to give a distinct set of people the ability to do code review, I'm not sure why there is a problem with having the repositories named after the "genre" of schema(ta?) that they contain (e.g. 'production,' 'analytics', 'logging'...). I can imagine other "genres" that might appear.

Thoughts about these pairs:

primary/main/major/principal
Given the proliferation of events that are not from EventBus, doesn't this seem like the tail wagging the dog? This system is not built around EventBus, even if that's how it started. Baking in a priority judgement is confusing to novices. Why not just accomodate distinct use-cases equally?

critical/vital/essential
Schema just dictate the shape of events. Events are vital to whatever depend on them. MediaWiki depends on the events produced by EventBus, and MediaWiki is the most important software we run, but why conflate the importance of MediaWiki with the importance of these schema? Again, it seems like an unnecessary priority/value judgement when a descriptive one would do fine.

optional/supplimentary/auxiliary/nonessential/minor/extra/extraneous
As with above, I think too many of these words imply a distinct policy as to the events that carry these schema. Even in the cases where this is true, these schema are not where such a policy is exercised, so I think it's confusing. On a more obvious note (and I know this is just a thesaurus dump), I think these are probably not the best words to use here if we want people to be happy using the platform.

'production'/'mediawiki' events
This is a bit of a side-note, but when we talk about events that are currently being sent from EventBus, I think the words I hear most often are either "production" or "mediawiki". These words just aren't descriptive, and I'm constantly having to explain them to people. In my opinion:

  • The word "production" is inadequate; I think what it's trying to capture is "platform" (as in "core platform"), because there are other things in production (such as analytics or logging).
  • Using "mediawiki" or "wikimedia" is also a problem, since obviously analytics and logging takes place in mediawiki (among other places), and certainly all of this is part of Wikimedia.

I'll note here that 'analytics events' has a high degree of acceptance and general understanding (this may be due to sampling bias though). I think something like 'platform events', which would map similarly to core platform, might have a similar impression.

Propose
I think right now I would favor something like
platform / analytics

But alternatives might be
production => platform/application/command/process/signal/instruction/message/special
analytics => analytics/instrumentation

I think the dichotomy trying to be emphasized in prior posts is between events that are operational, in the sense that they drive the state of an application, versus events that are observational, in that they report the state of a program at a point in time, but do not interfere with its operation? I see the dichotomy, but I'm not sure if end users will.

Well, anyway I sat down to write some documentation the other day and had to give up because of how weird some of the terminology has become, so I support the move to shed a few bikes.

Ottomata added a comment.EditedTue, Nov 19, 8:08 PM

Hm, but remember that 'analytics' will $ref sub schemas out of 'production'. If we didn't have the requirement to have different merge rights for these different types of schemas, we wouldn't have multiple schema repos.

I'd be all for naming the repos after what they contain IF we could think of a good and future proof name. It isn't clear that 'analytics' will only contain schemas for analytics or even just instrumentation purposes. I think even 'platform' is a bit blurry here too. Where should a schema that is used for a machine learning pipeline go? We might use that data to train some models in Hadoop, but the output might be used for 'production' features. Nuria's distinction between 'tier 1' and 'tier 2' feels right to me, even if I don't like those names.

'core' is not bad. core and ? I still like supplementary.

  • schemas/event/core / schemas/event/supplementary

?

I'm not sure why there is a problem with having the repositories named after the "genre" of schema(ta?) that they contain (e.g. 'production,' 'analytics', 'logging'...). I can imagine other "genres" that might appear.

The main problem with it is dealing with so many different repos. Aside from the merge permissions, we don't have any real need to separate schemas by functionality like this.

Wow this bike shed is going to look great! Just had a good IRC discussion with Jason and Dan. Our contenders are:

  • schemas/event/primary and schemas/event/secondary
  • schemas/event/core and schemas/event/accessory
jlinehan added a comment.EditedWed, Nov 20, 1:51 PM

New perspectives for the new day:

Take 1

I could see a good argument for keeping a dichotomy, but re-casting it as, essentially,

  • Parts that are "locked down" versus
  • Parts that are "still being figured out" or are "limited time things"

These two repos could still have different +2 rights, but the dichotomy would be more like "production" versus "beta". Perhaps something like:

  • standard/core/production/canonical/normal/regular/library/permanent and
  • experimental/variable/beta/development/special/heretical/trial

I could see pairs like core/experimental or permanent/temporary or production/trial or main/extra or things like that working. You get the idea.

Going this route could help us keep things relatively clean, since it would catch all of the one-off things in one repo that could be periodically archived or pruned. Things that have proved their worth could be folded into the main repository. The non-main repository could consist of nothing but top-level schemas, without functional (or any other kind) of directory structure. This (along with purges) would discourage long-term usage.

But could we just use a branch to do this?

Take 2

One repo with subdirectories, but the 'production/core/platform/' (or whatever) subdirectory is actually a submodule.

That way there are special +2 rights for the EventBus schema, and everything else has some other set of +2 rights (probably not as wide as people are fearing though, see below). The data dictionary and EventGate common stuff would not live in the production submodule, but in the main repository (see below for reasons).

Things might look like:

  • /standard/... for the base schema that EventGate needs, and field definitions and other parts of a data dictionary
  • /platform/... or /protected/... or /system/... for all of the EventBus things (actually a submodule)
  • /<function>/... for everything else, e.g. 'analytics', 'research', 'fundraising', what have you.
    • or the ontology could be changed so that rather than functional area you simply describe the schema, so rather than /logging/error/client, you do /error/client idk. Either way, it's a directory.

It's okay for the 'platform' stuff to depend on data dictionary and other things that aren't in their repo?
I was thinking about this, and, isn't this why we have versioning? If there is a change to the dependency, it will be versioned, right? It won't break the code in the privileged repo. If there is an incompatible change to the dependency, there can be another patch or, like, communication to figure out what happened, no?

Won't a lot of people have +2 in the base repo though?
Well right now things are on-wiki and that's pretty public. But since we are moving to version control, with CR in place, etc., there is going to already be a lot less freedom in terms of potential changes, even without the distinct service levels. Plus, I don't think that many people even need to have +2 on the base repo, because...

Wide schema +2 is not really a strategic goal I think
I think the strategy overall is to see how much of the volatility we can push to the stream configuration, right? That means that we WANT the schema to be less volatile. There are plans in product to see if we can cut our analytics schema down to a small, standard set of building blocks that can be re-used -- with the exception for this or that one-off schema probably. In this scenario, the lack of wide +2 access would be a feature, not a bug. We could use the friction to encourage re-using a set of standardized schema that get hammered out.

Just some thoughts to kick around.

Hm, interesting.

The main problem I see with Take 2 is the URIs. I want the relative schema URIs to stay meaningful descriptions of what the schema is (closely related to its title). That means all $refs and event $schema IDs need to be relative to a 'schema base path' in a repository. For the existent mediawiki/event-schemas, this schema base path is ./jsonschema/[1]. If we treat these 2 repos as if we only had one repo, the I think the URIs are going to get confusing. How would mediawiki/revision/create $ref the common meta schema? What would it be relative to?

Wide schema +2 is not really a strategic goal I think

As it is now, there is no deployment process for schemas (well, for schemas in the http schema service). Puppet just pulls the latest master of the schema repo. This generally should work because schemas version files are (should be mostly) immutable. eventgate-analytics and Hadoop are the only main users of the http schema service, so a breaking change in a production schema wouldn't break anything in real production services until e.g. eventgate-main is manually redeployed with the breaking change.

I think we need to talk more about this. It would be nice if we could just have a single schema repository. We should think about the groups of people that actually need to have +2.

BTW, We had planned to use a submodules to resolve $refs to the production schema repo anyway, so Take 2 isn't too different than what we are going to do. I just don't think it would be a good idea to make the production repo depend on anything in the analytics repo. Yes things are versioned, but allowing a main part of a restricted schema to be changed by someone with unrestricted rights defeats the purpose of the restricted schema repo.

[1] Hm! I was about to say I wanted to keep the ./jsonschema/ subdir there for some reasons...but as I think about it those aren't good reasons...so nevermind!


Tangential idea: should we start namespacing schemas simliar to how Java (and Avro) does namespacing? E.g.

org/wikimedia/analytics/cool-button/click
org/wikimedia/sparql/query
org/wikimedia/mediawiki/page/view
org/mediawiki/revision/create

(As I write this example I'm having trouble coming up with a consistent convention, but we could bike shed on that too.)

This would mean events would have to set e.g. `$schema: /org/mediawiki/revision/create/1.0.0'?

I dunno, just a thought; it might make it easier for teams to feel more at ease making changes in some namespace they have purview over.

jlinehan added a comment.EditedWed, Nov 20, 3:26 PM

That means all $refs and event $schema IDs need to be relative to a 'schema base path' in a repository

Hmm, I need to read more about this and see some more examples and we can follow up.

I think having meaningful descriptions (rather than functional areas) is good. I guess what I'm thinking is that it would also be nice for the directories to help organize certain dependencies. For example, if we use a different meta field for analytics events, it would be nice for it to be clear which schema are using that, by having them all live in a common directory. The schema that they all descend from which has this special meta field would either be in that directory and have a certain designated name, or be in a common directory but the schema has a name that identifies the subdirectory which uses it (e.g. common/analytics is used by schema in analytics/.

If we instead want to be fully committed to the "descriptive" ontology style, then this is obviously not going to happen (what's an "analytics?"), but it might make it difficult to do an audit or for people to get the lay of the land, at least without some additional tooling.

Other things about dependencies that might or might not be true:

  • I think there will probably not be references between schema within the same top-level subdirectory, with the exception perhaps of all the schema referencing a common schema to define e.g. the meta field.
    • While it's possible to extend various schema OOP-style, I don't know that we necessarily want to encourage such a web of dependencies?
  • That means that the references for schema within e.g. an /analytics subdirectory could look like (rough example):
    • /standard/common, the local common schema would extend this global one
    • /analytics/common, for the locally-defined meta field
    • /standard/field/useragent, /standard/field/wiki, etc. for various "data dictionary" things used by various schema e.g. /analytics/fooclick/

I think we need to talk more about this. It would be nice if we could just have a single schema repository. We should think about the groups of people that actually need to have +2.

Agreed. I think I prefer the "take 1" that I put up, just because it side-steps the issues with how the subdirectories are organized, how the ref dependencies and access levels interact, and about "prioritizing" certain sets of schema. These schema currently map to functional areas, which makes these labels kind of political, but I think that is kind of a coincidence and will go away, if we decide to simply treat things as "permanent" and "temporary". There's nothing that says an analytics schema can't be "permanent," subject to the additional scrutiny of the other events. It would also be easier to make a regular process around "productionizing" schema that we can dial-in. Maybe we don't need a heavy process with a separate repo, or maybe we do. We can defer that decision until later, but we know how we'd handle it if the time comes.

I just don't think it would be a good idea to make the production repo depend on anything in the analytics repo.

I agree about the principle, but I'm just not sure about the dependency. If we have versioning and CI or git hooks that e.g. prevent altering a prior version, then doesn't that (theoretically) mean that patches can't alter existing dependencies? It would certainly be possible for a malicious individual, but I think it would be pretty unlikely for a reasonable engineer given the various safeguards (versioning, hooks, still requiring CR from somebody, etc.).

Tangential idea: should we start namespacing schemas simliar to how Java (and Avro) does namespacing? E.g.

Let's explore it! There needs to be some kind of convention here I think.

it might make it easier for teams to feel more at ease making changes in some namespace they have purview over.

This is something important.

Hm, you might be onto something with this permanent vs experimental idea. For example, should things like VirtualPageView and NavigationTiming live in permanent repo? Whereas temporary A/B test type schemas would live in experimental?

but the dichotomy would be more like "production" versus "beta"

This idea isn't quite production vs. beta. beta implies an intended graduation to production, which wouldn't be true for all schemas there. 'exploratory'?

Hm, another idea: I think we can grant different merge rights to different branches in gerrit. We could make master restricted, but have an exploratory branch. We could even have exploratory (and master too?) checked out in the http schema service.

Ottomata added a comment.EditedWed, Nov 20, 6:57 PM

Ok no branch, Dan convinced me, I think that will just confuse more people that 2 repos.

We could just go with one repo for now (a new one, not mediawiki/event-schemas), and see if we don't need to have a second just for different merge permissions. We will need to do some schema namespacing within the repos anyway; even if we have multiple repos, we will resolve $refs in them as if they were all in a single repo.

So, what about schema/event/main, with new 'analytics' schemas in some to-be-bikeshed namespace. We've already got a /common and a /mediawiki namespace (these are part of the relative schema URIs) We could consider modifying those here too, but I'm not sure it is worth it. Where would these analytics schemas go? I'm ok with /instrumentation?

If we decide we do need more repos for permissions reasons later, we can move the /instrumentation schemas (and others?) to it then.

FYI, to clarify a point here: I think repos should not be named ontologically, but 'namespaces' should if they can. These namespaces are actually part of the schema titles. Repositories are just decentralized stores for schemas.

I'm also liking the idea of experimental, and I think experiment is actually really nice and concise. It is fun and easy and kind of the same thing as instrument but less specific so it stays more flexible.

So I vote for core / experiment. Final answer.

@Milimetric how do you feel about 'exploratory' vs 'experimental'? While I like the idea that the non-restricted repo should be used for 'temporary' things, Nuria said she thinks that is a bad requirement. Schemas will more realistically live there forever. Exploratory seems ok to me, as it doesn't imply temporariness.

Also, thoughts on https://phabricator.wikimedia.org/T206789#5679422 ?

Nuria added a comment.Wed, Nov 20, 9:22 PM

Some ideas:
foundational/experiments -> my favorite
foundational/analytical
foundational/secondary

core/experiments (dan's suggestion)
core/analytical
core/secondary

primary/secondary

principal/auxiliary

@Nuria I thought you didn't like 'experiments'? Doesn't that imply that the schemas there are intended to be for short term usage?

I don't like 'foundational', it just sounds weird, especially given some of the other options. I do like 'primary', 'secondary' is ok, but does imply some kind of lineage. My fav is 'primary' or 'main', and 'supplemental' or 'ancillary' or 'accessory'.

Another thought: there's no reason we couldn't make the 'main' schema repo be just 'schema/event', and any additional repos in a sub hierarchy. 'schema/event/exploratory'.

If we are going the exploratory type name, I like 'schema/event/main' and 'schema/event/exploratory' the best.

Nuria added a comment.EditedWed, Nov 20, 10:41 PM

Doesn't that imply that the schemas there are intended to be for short term usage?

mmm..this is hard but still voting for "core"/"experiment" and "core/secondary"

We will need to do some schema namespacing within the repos anyway; even if we have multiple repos, we will resolve $refs in them as if they were all in a single repo.

...... !

Another thought: there's no reason we couldn't make the 'main' schema repo be just 'schema/event', and any additional repos in a sub hierarchy. 'schema/event/exploratory'.

If we did only one named repo, I think I would argue that the named one should be the more restrictive one, since I think it will contain fewer events and be used by fewer people, and be more of a special case in general. Especially if the repo name is not visible in $refs, I think it would be better labeled something like protected or reserved, to emphasize that this policy is particular to the git review process and not to the schema themselves. (policies relative to the schema, such as standard or core, should be namespaces).

If we did two distinct repos, I would suggest something like protected/public, closed/open, reserved/open, etc. Again to emphasize that this is primarily access control, without implying any distinction about how the schemas are used. Those distinctions should be communicated with namespaces.

Hmm, I'm leaning towards two here because it doesn't confuse the hierarchy inside the repo with the path in the URL of the repo. @Ottomata let me know if I'm understanding correctly that the $ref will abstract away the name of the distinct repos so these names won't be visible in any of the (relative?) references.

jlinehan added a comment.EditedThu, Nov 21, 2:10 PM

If we can't come to any other consensus, I think the best compromise looks like primary/secondary. It's neutral enough that we can shape the definitions in documentation or conversation. It has the flavor of @Nuria's tier-ness ideas which I think is useful, but I don't know if that mapping will be durable in this particular location. They don't say much about access control on their own, and I think they communicate the wrong priority in some senses. But being boring counts for a lot, and maybe it will just not be that visible for people to get confused. If it gets out of control, we could always change them. I think the system would survive :)

I'm good with primary / secondary