Page MenuHomePhabricator

Develop a library for JSON schema backwards incompatibility detection
Open, Stalled, NormalPublic

Description

I order to make sure the schemas are evolving in a backward-compatible way we need an automatic tool to detect violations of the rule. I've been trying to find an existing library, but have failed to do it, so I believe we will have to develop one of our own to integrate it into the CI and pre-commit hooks for the schema registry.

Event Timeline

Pchelolo triaged this task as Normal priority.Oct 12 2018, 11:17 PM
Pchelolo created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2018, 11:17 PM

We could perhaps hack around it in the following way. Each time a new version of the schema is created, a sample message also needs to be committed. That message can then be checked with existing JSONSchema validators against both the new and the old version of the schema.

So, some search results. I will not include only things that are remotely applicable to us:

Unfortunately, I couldn't really find any more or any more mature projects, so I believe we will have to develop an in-house solution :(

Oook thanks Petr!

Ottomata moved this task from Backlog to Next Up on the EventBus board.Dec 5 2018, 10:04 PM
Ottomata removed Ottomata as the assignee of this task.Dec 5 2018, 10:10 PM
Nuria removed the point value for this task.Dec 5 2018, 10:21 PM
Nuria added a comment.Dec 5 2018, 10:25 PM

Ping @Ottomata @Pchelolo is this work we are iaming to do for next quarter (january onwards) as we are also working on dockerization?

Ping @Ottomata @Pchelolo is this work we are iaming to do for next quarter (january onwards) as we are also working on dockerization?

This has a very narrow scope and I do not expect this to be very hard. I already have some (fairly basic) code in event-schemas tests. We could just extract that to be a library and call it a day. However, I think we would want to make it a bit more sophisticated. This would be required for the CI in the event-schemas repo.

So yes, I think we can do it next quarter. It's a nice self-contained piece of work.

Nuria added a comment.Dec 5 2018, 10:33 PM

Sounds good, this item seems somewhat connected to this one : https://phabricator.wikimedia.org/T206812 but I can see how both are independent pieces of work. Please assign to yourself when you feel you have the bandwith to work on it.

Ottomata assigned this task to Pchelolo.Jan 17 2019, 9:30 PM
Pchelolo added a comment.EditedJan 22 2019, 6:23 PM

Before we begin, some bike-shedding needs to be done.

  1. Language. I propose node as we already have a little bit of code in node that does it, plus I'm more familiar with node. Since this only will run in CI for testing, performance of this doesn't matter.
  2. Since it will be an independent library (will it? should it? I think it should) we will need a name for it. I'm terrible in naming, so suggestions are welcome. json-schema-compatibility-checker seems too long.
  3. The interface. Should it just expose a single function returning a boolean or better return all the incompatibilities found. I would prefer the latter even though it's harder to implement.
  4. I think, at least in the first iteration, it should not resolve $ref - in our use case we would have a feature to resolve references as a hook, so we can easily pass in full schemas with already resolved references.

Yes node is fine with me!

Agree report on incompatibilities would be good. Or, at least the first found. Something that gives folks an idea of what is wrong. I'm ok if it stops on the first incompatibility found. Reporting all incompatibilities is even nicer.

Names! jsonschema-compatibility jsonschema-compat?

There's already json-schema-compatibility lib on npm, but there's always some libs with a similar name to anything we could imagine, so jsonschema-compatibility is fine by me.

Last piece of a bikeshed - gerrit vs github? This feels like it could be generically useful outside of the WMF, so I'm leaning to github. Thoughts?

I think github too, but am not opinionated on this one. :)

  1. Language. I propose node as we already have a little bit of code in node that does it, plus I'm more familiar with node. Since this only will run in CI for testing, performance of this doesn't matter.

It would be preferable for people to also use/run it as a pre-commit hook in git, but I agree that it also needs to run in CI.

  1. Since it will be an independent library (will it? should it? I think it should) we will need a name for it. I'm terrible in naming, so suggestions are welcome. json-schema-compatibility-checker seems too long.

+1 to both.

  1. The interface. Should it just expose a single function returning a boolean or better return all the incompatibilities found. I would prefer the latter even though it's harder to implement.

I agree with @Ottomata that stopping on the first incompatibility is enough. I think in most cases the first error will be intuitive enough for people to figure out all the things they have to check.

  1. I think, at least in the first iteration, it should not resolve $ref - in our use case we would have a feature to resolve references as a hook, so we can easily pass in full schemas with already resolved references.

Agreed. This could be done in a second step.

Last piece of a bikeshed - gerrit vs github? This feels like it could be generically useful outside of the WMF, so I'm leaning to github. Thoughts?

If this will be a generic checker, then yes, I agree GH makes it more discoverable.

I think, at least in the first iteration, it should not resolve $ref - in our use case we would have a feature to resolve references as a hook, so we can easily pass in full schemas with already resolved references.

Also agree.

json-schema-compatibility-checker

+1 to this, this is how you would for search for this functionality on google

The topic appears to be deeper than I thought. Please read https://docs.confluent.io/current/schema-registry/docs/avro.html for the context.

So, for Avro they identify several types of schema compatibility.

  • BACKWARD - "consumers using the new schema can read data produced with the last schema"
  • BACKWARD_TRANSITIVE - same, but with 2 older schemas
  • FORWARD - "data produced with a new schema can be read by consumers using the last schema, even though they may not be able to use the full capabilities of the new schema"
  • FORWARD_TRANSITIVE - same, but with 2 newer schemas
  • FULL - both BACKWARD and FORWARD
  • FULL_TRANSITIVE - both BACKWARD_TRANSITIVE and FORWARD_TRANSITIVE

Given that we are intending to check compatibility between all schema versions, *_TRANSITIVE is not really important for us.

Given that producers in our system will use the exact schema version, we can ignore them in the equation - no schema change will break producers any longer. However, we need to decide which kind of compatibility we want. The implications are:

  • BACKWARD or BACKWARD_TRANSITIVE: there is no assurance that consumers using older schemas can read data produced using the new schema. Therefore, upgrade all consumers before you start producing new events.
  • FORWARD or FORWARD_TRANSITIVE: there is no assurance that consumers using the new schema can read data produced using older schemas. Therefore, first upgrade all producers to using the new schema and make sure the data already produced using the older schemas are not available to consumers, then upgrade the consumers.

FULL compatibility is probably not an option since it's way too restrictive. I assume we actually want FORWARD compatibility in confluent terminology. Is that correct? Should the library support checking for all 3 types?

Given that we are intending to check compatibility between all schema versions, *_TRANSITIVE is not really important for us.

+1

Given that producers in our system will use the exact schema version, we can ignore them in the equation - no schema change will break producers any longer.

Hmmm you seem to assume we control all the producers, but I'm not sure that will be true for this new system. Since it will also replace EventLogging, this might not be the case.

FULL compatibility is probably not an option since it's way too restrictive. I assume we actually want FORWARD compatibility in confluent terminology. Is that correct?

Even though we won't control all producers, it is safe to assume we will control most of them, so I agree that forward compatibility is the better option here.

Should the library support checking for all 3 types?

Perhaps we can start with forward, and then add others if/when needed?

Hmmm you seem to assume we control all the producers, but I'm not sure that will be true for this new system. Since it will also replace EventLogging, this might not be the case.

We will require all events to include the $schema property with a URI pointing to a specific version of the schema and the event will be validated against the specific version.

Perhaps we can start with forward, and then add others if/when needed?

It's easier to design it with all options in mind right away.

Hm, I'm not sure if we need to try and do what Avro does here. All we need is a fancier 'is subset schema' feature (with checking of required fields), right? We just need to make sure that a new schema version is a superset of the previous (or all previous) versions, right?

With JSON, consumers don't really 'read data' with the schema. The schema allows us to ensure that the data the consumers get will be backwards compatible; meaning old consumer code will be able to read data produced with the new schema. We'll eventually use Kafka Connect with these schemas to ingest into downstream strongly typed systems (e.g. Hive, etc.), and in that case the new schema versions need to be compatible with the old ones more explicitly, but ultimately it is really about ensuring the data is compatible.

With JSON, consumers don't really 'read data' with the schema. The schema allows us to ensure that the data the consumers get will be backwards compatible; meaning old consumer code will be able to read data produced with the new schema.

This is exactly what avro calls 'forward compatible' and exactly what I'm doing

Hm... After hacking this around a little bit more, I think creating something general-purpose was actually a bad idea. JSON schema is much too powerful for us and what we really need is something very specific for us making sure nobody uses features of the schema we do not want people to use. As an example, json schema supports declarations like type: [ 'string', 'number' ] - which means the property could be either a string or a number. That would clearly break hadoop refinery, so we want to disallow this feature.

My WIP code for the library is here, but the more I look into it the more I think we need to abandon the idea of the general-purpose library and just verify bits and pieces we need.

Thoughts?

Nuria added a comment.Feb 8 2019, 4:40 AM

My WIP code for the library is here, but the more I look into it the more I think we need to abandon the idea of the general-purpose library and just verify bits and pieces we need.

Much agree, I think we should focus on our use cases while not trying to over-generalize.

Ya I think up to your discretion. I'd be ok with a generic library that only checked that old fields are not modified, e.g. the 'superset' schema thing. We'd still need more stuff for sure, so if you think it makes sense to not bother developing a generic thing too, that's fine.

Although, I'm not sure if we should bother using CI to enforce JSONSchema restraints. If we can, that's awesome, but I wouldn't make it a priority. Good guidelines and schema review should be enough I think.

Pchelolo changed the task status from Open to Stalled.Feb 8 2019, 4:59 PM

Although, I'm not sure if we should bother using CI to enforce JSONSchema restraints. If we can, that's awesome, but I wouldn't make it a priority. Good guidelines and schema review should be enough I think.

CI is definitely better cause small bugs can get unnoticed. However, here it seems that we have to start developing the guidelines before the code that would check that since due to the need to import the events into hadoop we will likely have a lot of very specific rules in those guidelines. I'm inclined to move this task to 'stalled' for now

Hm, can we do this piece by piece? I'm fine with aborting the idea of a flexible generic library. We do need a something standalone (outside of event-schemas repo) that we can apply anywhere for schema CI. Can we start with something that verifies the superset schema stuff? That types of fields are never changed?

That I already have, just need to revert a couple of commits that went too far.

That types of fields are never changed?

Actually, they can change. For example 'number' -> 'integer' - totally compatible. [ 'string', 'null' ] -> 'string' - same :)

Ya, but I think we shouldn't allow that. Since JSONSchema doesn't have a decimal type, we are going to assume that all numbers are decimals (they will be created in Hive as doubles) and all integers are ints.

[string, null] is technically compatible with [string], but I really think we should just enforce that types don't change. If we want to make exceptions later, we can add them (Unless you've already implemented this, that's fine then!) Composite types like that are going to be discouraged anyway.