While Wikidata JSON exports (https://www.wikidata.org/wiki/Special:EntityData/Q1.json) currently have revision ID, Wikidata JSON dumps do not have any revision IDs or markers like that. That makes it harder to both integrate the data and assert the up-to-date-ness of the data for third-party services. I think we need some identifier - may be Mediawiki release ID or running counter or timestamp or anything else - that would be present in all recommended forms of getting data from Wikibase and will allow to track up-to-dateness of the data entry.
|Open||None||T88728 Improve Wikimedia dumping infrastructure|
|Open||None||T88991 improve Wikidata dumps [tracking]|
|Open||None||T87283 Wikidata dumps should have revision ID or other sequence mark|
I am wondering what is the status of this: is more discussion needed about what version information to include, or are we simply waiting for a patch?
I vote for returning the same serialization as in Special:EntityData: this would provide timestamp, revision id, page id, so that consumers can use whatever they want and get a consistent output in the API.
If there is consensus for that, and if directed to the relevant part of the code, I could contribute a patch.
OK it seems to be a bit unclear whether this was asking for revision IDs on particular entity or on the dump as a whole. I think that we need both, but the patch above seems to add the revision ID to entities. I think it makes sense.
@Smalyshev okay! Sorry if this is not the right place: I would be happy to migrate the patch to another ticket. Indeed the patch only adds entity-level metadata, not dump-level metadata. I think this would be less of a breaking change, given that it does not require changing the dump structure (and of course it is more useful to me, haha!)
@Lydia_Pintscher we would need your thoughts about this.
In a nutshell, the proposal is to add the lastrevid field currently exposed in Special:EntityData and in the API (action=wbgetentities) to the JSON dumps too. This field stores the id of the revision which contains the entity as serialized. Currently, no revision information is included in the dumps at all. Because this only adds an extra field to JSON objects, without changing the rest of the structure, this should not break any reasonable consumer, especially if they are able to parse JSON representations from the API as well.
The patch is here: https://gerrit.wikimedia.org/r/500806 (I'll fix the issues picked up by Jenkins if you think the change is useful)
Thanks for working on this! :)
Yeah this seems fine. I think keeping the naming consistent with the same field in Special:EntityData is good.
Lucas and I also quickly talked about if this should go to the end or the beginning of the line and we're leaning to the end so the first thing you see is the entity id and type. But that's more a preference than a must.
Regarding dump-level metadata, it would be super useful to know what timestamp should be passed to EventStreams for catching up with user edits after the dump was produced. To find this timestamp, can clients extract the entity ID with the highest lastrevid from a Wikidata dump, and then retrieve the corresponding timestamp via Special:EntityData like this? Or would a sync-up client loose some edits if it were to do this? (For example, if dumps get produced by parallel workers, they’d probably have to agree on a cut-off revision before starting the dumping process; otherwise, the JSON file wouldn’t necessarily contain all changes before the highest lastrevid in the dump file... correct?)