Page MenuHomePhabricator

[Story] Versioning in JSON output
Open, MediumPublic

Description

From story time meeting on 03-17:

Different use cases for JSON format version info: API, dump-files, Special:EntityData.

Idea by @daniel and @adrianheine:
Having generic mechanism for a meta-info header that contains version(s) info, license info, etc..
For each of the use cases outlined above, the meta-info would go to a specific location:

  • Special:EntityData would have it inlined
  • API responses would have it side by side with the entity info
  • JSON dump files would be accompanied by a separate .meta.json file
NOTE: Versioning of the *model* (DataValue, Wikibase) is separate for the versioning of the *serialization*. Ideally, our meta-data would contain both.
NOTE: There should be a separate task for having JSON version info in the database (we will not do it for now).

Event Timeline

Tobi_WMDE_SW raised the priority of this task from to Medium.
Tobi_WMDE_SW updated the task description. (Show Details)
daniel set Security to None.

This should be very easy with the datamodel-serialization component now, we could simply specify the version of the component in / near the output....

In terms of the API we would also want to bind the other information that we add around the serialization of an entity to said version number.....

Lydia_Pintscher renamed this task from Versioning in JSON output to [Story] Versioning in JSON output.Sep 10 2015, 12:37 PM

Right now, the JSON dump format is a sequence of JSON objects. Each of these JSON objects is a Wikidata entity. There is nothing preventing the dump format from having the first JSON object be information about the dump, including version of the dump format, version of wikidata format, time of dump, etc.

As long as this JSON object did not conform to the form of JSON objects that encode Wikidata entities this change would not be a breaking change! (I do think that it would be better for it to not have any of the names that are currently being used in JSON objects that encode Wikidata entities.)

There is nothing preventing the dump format from having the first JSON object be information about the dump, including version of the dump format, version of wikidata format, time of dump, etc.

As long as this JSON object did not conform to the form of JSON objects that encode Wikidata entities this change would not be a breaking change!

Of course it would by a breaking change. There is no formal spec of the JSON dump beyond the spec for the individual entities, but we have always said that the dump is a set (an array) of entities. Putting something in there that is not an entity will break consumers.

If we are going to break the format, I prefer to introduce a proper envelope with a clear place for meta-data.

Note however that for the Special:EntityData interface, we have a similar but different problem: There we have only a single entity object, with no array or other structure around it. We can easily put the meta-info into the object itself, but that is semantically ugly. We are already mixing info about the page (revision, timestamp, etc) with the item data. Adding meta-info about the file would be possible, but would increase the mess.

So we might want to introduce a similar envelop structure there - which would be a pretty huge breaking change to the interface we use to resolve URIs. That's not to be taken lightly. Even if we introduce version info into the URL, we can't change the URIs, so clients would still get an unexpected data structure.

All this considered, the original proposal is probably still the best:

  • Special:EntityData would have the version info inlined
  • API responses would have it side by side with the entity info
  • JSON dump files would be accompanied by a separate .meta.json file

This is ugly because it means the version info will be in a different place depending on how you retrieve the data, but at least it wouldn't be a breaking change.

Of course it would by a breaking change. There is no formal spec of the JSON dump beyond the spec for the individual entities, but we have always said that the dump is a set (an array) of entities. Putting something in there that is not an entity will break consumers.

Well I think that it should be a breaking change, but I read the stable interface policy as saying that it isn't. Well-behaved consumers are supposed to be tolerant of extra information. Adding a new item to the dump array is adding extra information. This extra information does not change the meaning of the existing information in any way.

I'm looking at https://www.wikidata.org/wiki/Special:EntityData/Q42.json structure, and I see it is:

{
  "entities": {
    "Q42": { ...stuff... }
  }
}

So, why can't we have:

{
  "entities": {
    "Q42": { ...stuff... }
  },
  "@metadata": {
     "version": "1.0",
     "phase-of-moon": "gibbous"
  }
}

etc.? I don't think well-behaved clients should mind? But even if they do, it shouldn't be hard for them to make one-time change to ignore @metadata if they don't want it? After all, we already have other items besides entities - e.g. in API response there's success.

The same can be done with wbgetentities, etc. APIs. For the dump it's a bit harder as we don't have any kind of encompassing structure beyond entity data... Maybe we should make it use entities property too like API format? It will be a breaking change but at least it breaks the stalemate and allows us to easily add more info later. It won't be also hard to distinguish between the formats, including by streaming readers - you actually need only the first character to know which one you're dealing with, and converting new one into old one or back should be rather easy.

@Smalyshev Oh right, I forgot that we do already output an envelope from Special:EntityData! That makes things easier.

For the dump, yea, we have the choice to break the format really hard, or associate a secondary file containing the meta-data.

Metadata file maybe would be fine too, then it should look the same as the API format without entities, I think, e.g. we'd have:

{
  "@metadata": {
     "version": "1.0",
     "phase-of-moon": "gibbous"
  }
}

That'd make parsing it easier, I think.

Well I think that it should be a breaking change, but I read the stable interface policy as saying that it isn't. Well-behaved consumers are supposed to be tolerant of extra information. Adding a new item to the dump array is adding extra information. This extra information does not change the meaning of the existing information in any way.

I would definitely treat it as a breaking change, but you are right that the policy isn't very clear about that. The reason I think it is breaking is that if we have a concept of "list of X", you can't add a Y to it (if Y is not an X) without breaking the format. Adding an incompatible Y to the list means that the list is no longer homogenous. In my mind, that changes the interpretation. But I'll add a note-to-self to clarify this point. Thanks!

Metadata file maybe would be fine too, then it should look the same as the API format without entities, I think:

agreed

Concerning the dumps, it should be possible to add versioning information on a per-entity basis, for instance by adding the revision id in the JSON serialization of the entity, as is currently done in Special:EntityData. This would arguably be more useful than a per-dump versioning, given that the dump generation process is not atomic. It would also be less of a breaking change: it would just amount to make JSON serialization of entities more uniform. This is debated in T87283.

This will likely get worked on, at least for Api output with the work on federation that should happen toward the end of this year.