[Story] Versioning in JSON output
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Tobi_WMDE_SW
	Mar 17 2015, 2:41 PM

Description

From story time meeting on 03-17:

Different use cases for JSON format version info: API, dump-files, Special:EntityData.

Idea by @daniel and @adrianheine:
Having generic mechanism for a meta-info header that contains version(s) info, license info, etc..
For each of the use cases outlined above, the meta-info would go to a specific location:

Special:EntityData would have it inlined
API responses would have it side by side with the entity info
JSON dump files would be accompanied by a separate .meta.json file

NOTE: Versioning of the *model* (DataValue, Wikibase) is separate for the versioning of the *serialization*. Ideally, our meta-data would contain both.

NOTE: There should be a separate task for having JSON version info in the database (we will not do it for now).

Related Objects

Mentioned In: T149410: For consistency MediaInfo serialization should use "claims" as key, rather than "statements"
T87283: Wikidata dumps should have revision ID or other sequence mark
T142746: Add format version to JSON export
T142084: Document interface stability policy for Wikibase
T48556: [Epic] Wikidata 3rd party client (Instant Wikidata)
Mentioned Here: T87283: Wikidata dumps should have revision ID or other sequence mark

Event Timeline

Tobi_WMDE_SW created this task.Mar 17 2015, 2:41 PM

Tobi_WMDE_SW raised the priority of this task from to Medium.

Tobi_WMDE_SW updated the task description. (Show Details)

Tobi_WMDE_SW added subscribers: Tobi_WMDE_SW, daniel, JanZerebecki, • adrianheine.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2015, 2:41 PM

Tobi_WMDE_SW added a parent task: T56085: [Task] EntityIdValues should be serialized as strings, not type/number structures..Mar 17 2015, 2:45 PM

MC8 added a project: MediaWiki-extensions-WikibaseRepository.Mar 17 2015, 2:50 PM

MC8 subscribed.

MC8 unsubscribed.

Lydia_Pintscher added a project: Wikidata.Mar 17 2015, 2:54 PM

Lydia_Pintscher subscribed.

daniel updated the task description. (Show Details)Mar 17 2015, 3:29 PM

daniel set Security to None.

Lydia_Pintscher moved this task from incoming to ready to go on the Wikidata board.Mar 17 2015, 3:51 PM

This should be very easy with the datamodel-serialization component now, we could simply specify the version of the component in / near the output....

In terms of the API we would also want to bind the other information that we add around the serialization of an entity to said version number.....

Addshore mentioned this in T48556: [Epic] Wikidata 3rd party client (Instant Wikidata).Aug 24 2015, 10:19 AM

Ricordisamoa subscribed.Aug 26 2015, 4:49 PM

Lydia_Pintscher renamed this task from Versioning in JSON output to [Story] Versioning in JSON output.Sep 10 2015, 12:37 PM

JanZerebecki added a project: Story.Sep 25 2015, 8:32 PM

Tobi_WMDE_SW added a project: Wikidata-Sprint-2016-04-26.Apr 25 2016, 1:37 PM

Addshore awarded a token.Apr 25 2016, 2:06 PM

Tobi_WMDE_SW removed a parent task: T56085: [Task] EntityIdValues should be serialized as strings, not type/number structures..Apr 26 2016, 1:09 PM

Tobi_WMDE_SW removed a project: Wikidata-Sprint-2016-04-26.

daniel mentioned this in T142084: Document interface stability policy for Wikibase.Aug 5 2016, 3:59 PM

aude mentioned this in T142746: Add format version to JSON export .Aug 11 2016, 8:42 PM

Smalyshev merged a task: T142746: Add format version to JSON export .Aug 11 2016, 8:57 PM

Smalyshev awarded a token.

Smalyshev added subscribers: Smalyshev, aude.

Right now, the JSON dump format is a sequence of JSON objects. Each of these JSON objects is a Wikidata entity. There is nothing preventing the dump format from having the first JSON object be information about the dump, including version of the dump format, version of wikidata format, time of dump, etc.

As long as this JSON object did not conform to the form of JSON objects that encode Wikidata entities this change would not be a breaking change! (I do think that it would be better for it to not have any of the names that are currently being used in JSON objects that encode Wikidata entities.)

In T92961#2577993, @Pfps wrote:

There is nothing preventing the dump format from having the first JSON object be information about the dump, including version of the dump format, version of wikidata format, time of dump, etc.

As long as this JSON object did not conform to the form of JSON objects that encode Wikidata entities this change would not be a breaking change!

Of course it would by a breaking change. There is no formal spec of the JSON dump beyond the spec for the individual entities, but we have always said that the dump is a set (an array) of entities. Putting something in there that is not an entity will break consumers.

If we are going to break the format, I prefer to introduce a proper envelope with a clear place for meta-data.

Note however that for the Special:EntityData interface, we have a similar but different problem: There we have only a single entity object, with no array or other structure around it. We can easily put the meta-info into the object itself, but that is semantically ugly. We are already mixing info about the page (revision, timestamp, etc) with the item data. Adding meta-info about the file would be possible, but would increase the mess.

So we might want to introduce a similar envelop structure there - which would be a pretty huge breaking change to the interface we use to resolve URIs. That's not to be taken lightly. Even if we introduce version info into the URL, we can't change the URIs, so clients would still get an unexpected data structure.

All this considered, the original proposal is probably still the best:

Special:EntityData would have the version info inlined
API responses would have it side by side with the entity info
JSON dump files would be accompanied by a separate .meta.json file

This is ugly because it means the version info will be in a different place depending on how you retrieve the data, but at least it wouldn't be a breaking change.

Of course it would by a breaking change. There is no formal spec of the JSON dump beyond the spec for the individual entities, but we have always said that the dump is a set (an array) of entities. Putting something in there that is not an entity will break consumers.

Well I think that it should be a breaking change, but I read the stable interface policy as saying that it isn't. Well-behaved consumers are supposed to be tolerant of extra information. Adding a new item to the dump array is adding extra information. This extra information does not change the meaning of the existing information in any way.

I'm looking at https://www.wikidata.org/wiki/Special:EntityData/Q42.json structure, and I see it is:

{
  "entities": {
    "Q42": { ...stuff... }
  }
}

So, why can't we have:

{
  "entities": {
    "Q42": { ...stuff... }
  },
  "@metadata": {
     "version": "1.0",
     "phase-of-moon": "gibbous"
  }
}

etc.? I don't think well-behaved clients should mind? But even if they do, it shouldn't be hard for them to make one-time change to ignore @metadata if they don't want it? After all, we already have other items besides entities - e.g. in API response there's success.

The same can be done with wbgetentities, etc. APIs. For the dump it's a bit harder as we don't have any kind of encompassing structure beyond entity data... Maybe we should make it use entities property too like API format? It will be a breaking change but at least it breaks the stalemate and allows us to easily add more info later. It won't be also hard to distinguish between the formats, including by streaming readers - you actually need only the first character to know which one you're dealing with, and converting new one into old one or back should be rather easy.

@Smalyshev Oh right, I forgot that we do already output an envelope from Special:EntityData! That makes things easier.

For the dump, yea, we have the choice to break the format really hard, or associate a secondary file containing the meta-data.

Metadata file maybe would be fine too, then it should look the same as the API format without entities, I think, e.g. we'd have:

{
  "@metadata": {
     "version": "1.0",
     "phase-of-moon": "gibbous"
  }
}

That'd make parsing it easier, I think.

In T92961#2579283, @Pfps wrote:

Well I think that it should be a breaking change, but I read the stable interface policy as saying that it isn't. Well-behaved consumers are supposed to be tolerant of extra information. Adding a new item to the dump array is adding extra information. This extra information does not change the meaning of the existing information in any way.

I would definitely treat it as a breaking change, but you are right that the policy isn't very clear about that. The reason I think it is breaking is that if we have a concept of "list of X", you can't add a Y to it (if Y is not an X) without breaking the format. Adding an incompatible Y to the list means that the list is no longer homogenous. In my mind, that changes the interpretation. But I'll add a note-to-self to clarify this point. Thanks!

In T92961#2580231, @Smalyshev wrote:

Metadata file maybe would be fine too, then it should look the same as the API format without entities, I think:

agreed

Addshore mentioned this in T87283: Wikidata dumps should have revision ID or other sequence mark.Aug 28 2018, 7:55 AM

Concerning the dumps, it should be possible to add versioning information on a per-entity basis, for instance by adding the revision id in the JSON serialization of the entity, as is currently done in Special:EntityData. This would arguably be more useful than a per-dump versioning, given that the dump generation process is not atomic. It would also be less of a breaking change: it would just amount to make JSON serialization of entities more uniform. This is debated in T87283.

This will likely get worked on, at least for Api output with the work on federation that should happen toward the end of this year.

Addshore mentioned this in T149410: For consistency MediaInfo serialization should use "claims" as key, rather than "statements".Jun 22 2019, 1:57 PM

Addshore unsubscribed.Jun 27 2023, 12:42 PM

[Story] Versioning in JSON outputOpen, MediumPublicActions

Description

Related Objects

Event Timeline

[Story] Versioning in JSON output
Open, MediumPublic
Actions