Page MenuHomePhabricator

Make it possible to use $ref in JSONSchemas
Closed, ResolvedPublic3 Estimated Story Points

Description

Current event-schemas repo suffers from a lot of copy-pasta - all the meta subobject schemas are copied to every schema, all the common properties related to revisions/pages/users are copied between files. JSONSchema has concepts which could eliminate the need for copy-paste: references, schema inheritance, merging different schemas etc.

These features are a part of the standard, so should be supported by the majority of clients and if we're worried they are not, we could use the features only in the master schema and expand it to a full schema in a pre-commit hook T206812

Another question is what to do if one of the core sub-schemas gets updated - do we automatically create a new version for all the schemas using them? I think that would be a wrong approach, instead, we could reference a specific version of the sub-schema in each of the schemas and leave the decision regarding updating to schema maintainers. This will give us more flexibility than trying to create sophisticated CI tests for checking all the copy-pasted documents. For example, we would be able to update the meta schema without simultaneously updating all the schemas in the registry (and all the producers) to comply with the new meta format and then slowly adapt use-cases.

I personally think using these features will be beneficial in the long run as maintaining potentially hundreds of copy-pasted JSON documents will get out of control pretty quickly and makes my eyes bleed from looking at this abomination of the DRY principle.

Event Timeline

Pchelolo created this task.

If we do adopt the policy to have the latest schema using the references and then getting rendered into full schemas in versioned files, so that clients are not required to support all the fancy features, we'd need to make the pre-commit hook an executable script to satisfy development requirements like:

As an engineer, I want to develop new code that uses schemas without committing changes to the production schema registry so that I don't endanger production during development.

Wow, two really great ideas in this task description!

we could use the features only in the master schema and expand it to a full schema in a pre-commit hook T206812

we could reference a specific version of the sub-schema in each of the schemas and leave the decision regarding updating to schema maintainers.

If we do both of these, schemas would be 100% resolvable by their generated version files, and we wouldn't need to worry about referenced schema changes because they also would have resolved versioned files!

I think we can do this then! Love it!

I wholeheartedly agree that doing copy/pasta for shared parts of the schemae is a bad design choice.

These features are a part of the standard, so should be supported by the majority of clients and if we're worried they are not, we could use the features only in the master schema and expand it to a full schema in a pre-commit hook T206812

While they are part of the standard, I think we should make it as easy as possible for clients to resolve what the schema is, even those not using libraries (e.g. a human being trying to discern what a message should contain). However, expanding a schema with common parts during the commit process seems simply a hack around manually doing the copy/paste as it suffers from similar problems, such as the local git repo not being up-to-date and hence potentially missing changes to common parts. Since we are going to have a schema registry system (and, I assume, a service that goes with it), perhaps the best way would be to keep the references in the original schema and have the service expand them automatically before delivering it to clients?

Another question is what to do if one of the core sub-schemas gets updated - do we automatically create a new version for all the schemas using them? I think that would be a wrong approach, instead, we could reference a specific version of the sub-schema in each of the schemas and leave the decision regarding updating to schema maintainers.

This is a valid point. If we are too strict we risk breaking things unintentionally. On the other hand, leaving it up to maintainers to update their references can easily lead to having multiple versions of the common parts that defeats the purposes of having common parts. I think the answer to that question should be yes, we have to enforce the common parts to be equal everywhere. Enforcing this rule could be dangerous, so I think we would need to create an upgrade process that ensures no breakage. As a first idea, any changes to common parts could be planned and announced well in advance. This would give us enough time to work with schema maintainers and event producers to adapt to the changes.

expanding a schema with common parts during the commit process seems simply a hack around manually doing the copy/paste as it suffers from similar problems, such as the local git repo not being up-to-date and hence potentially missing changes to common parts.

I don't think it's a hack. It is a trade off. We are trading complexity and fragility in production systems for complexity during the commit process. One of the driving choices to use git in T201643: RFC: Modern Event Platform: Schema Registry was decentralization. We didn't want a centralized production schema location, as it complicated all phases of the schema development lifecycle.

Since we are going to have a schema registry system (and, I assume, a service that goes with it), perhaps the best way would be to keep the references in the original schema and have the service expand them automatically before delivering it to clients?

We may have an http service the schema registry for convenience, but I want it to be as simple as possible. Many prod services likely shouldn't even use it directly, as it couples them to another service. I like the way eventlogging-service-eventbus works now with the local checkout of the git schemas repository. It simplifies the production systems.

On the other hand, leaving it up to maintainers to update their references can easily lead to having multiple versions of the common parts that defeats the purposes of having common parts. I think the answer to that question should be yes, we have to enforce the common parts to be equal everywhere.

I strongly disagree with this. We do this now with the EventLogging EventCapsule schema, and it makes any changes to the common schema extremely fragile. Changes to referenced common schema should also be backwards compatible, but if they aren't, we don't want to have to update every schema producer. This system will be used by hundreds of analytics schemas. Having the referenced versions explicitly set everywhere reduces the danger of changing any common schemas, and let's us use those changes in explicit places. This is kinda like how versioned software dependencies work; you don't just upgrade all dependencies to their latest versions every time they change.

BTW, we should probably be using $id for schemas using their versioned relative schema_uris. That will make them easier to use in $refs. E.g. $id: mediawiki/revision/create/3.

Hm a tricky bit about $refs and generating fully dereferenced schemas with AJV:

https://github.com/epoberezkin/ajv/issues/336

AJV doesn't actually create any derefernced schemas. It just uses the $ref pointers during validation.

I think by now we've all reached the agreement to use references.

All the schema paths must be absolute (starting with /).

I'm inclined to close the ticket. What do you think?

I think we can close this. How we actually use them hasn't been decided though. We'll need to find a schema $ref resolver library, or write one. AJV won't do it.

A quick update here.

We have managed to create a global schema resolver, so now we will use $ref to other files and avoid code duplication in schemas.

The last question to be answered here - do we want to use local, in-file references? (they start with #) - if so I believe AJV will be smart enough to not try using the global resolver for such references. Do we want to expose those to consumers as well or resolve them before pushing the new latest version of schemas (reminder, we intend to remove cross-file references as a CI step)

I believe that could be useful, however, we first need to test AJV behavior here when we combine cross-file references with local references and how are the possible conflicts resolved.

If they work, I'm fine with them as long as...

Do we want to expose those to consumers as well or resolve them before pushing the new latest version of schemas

Yes. I'd like for all non latest schema versions to generate fully dereferenced schemas. The schemas need to be really simple to make mapping them to e.g. Hive schemas or Kafka Connect data easy to implement.

We have managed to create a global schema resolver, so now we will use $ref to other files and avoid code duplication in schemas.

I can't remember this...we did this?

Ottomata renamed this task from Decide whether to use schema references in the schema registry to Make it possible to use $ref in JSONSchemas.May 17 2019, 3:21 PM

For posterity, this is being done in https://github.com/wikimedia/jsonschema-tools with json-schema-ref-parser.

Change 523745 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Use jsonschema-tools, add common schema, $ref it from test/event

https://gerrit.wikimedia.org/r/523745

Change 523745 merged by Ppchelko:
[mediawiki/event-schemas@master] Use jsonschema-tools, add common schema, $ref it from test/event

https://gerrit.wikimedia.org/r/523745

Nuria changed the point value for this task from 0 to 3.