Page MenuHomePhabricator

Wikidata dumps should have revision ID or other sequence mark
Open, MediumPublic

Description

While Wikidata JSON exports (https://www.wikidata.org/wiki/Special:EntityData/Q1.json) currently have revision ID, Wikidata JSON dumps do not have any revision IDs or markers like that. That makes it harder to both integrate the data and assert the up-to-date-ness of the data for third-party services. I think we need some identifier - may be Mediawiki release ID or running counter or timestamp or anything else - that would be present in all recommended forms of getting data from Wikibase and will allow to track up-to-dateness of the data entry.

Event Timeline

Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.
Pintoch subscribed.

I am wondering what is the status of this: is more discussion needed about what version information to include, or are we simply waiting for a patch?

I vote for returning the same serialization as in Special:EntityData: this would provide timestamp, revision id, page id, so that consumers can use whatever they want and get a consistent output in the API.

If there is consensus for that, and if directed to the relevant part of the code, I could contribute a patch.

Change 500806 had a related patch set uploaded (by Pintoch; owner: Pintoch):
[mediawiki/extensions/Wikibase@master] dumps: Add lastrevid to JSON entity dumps.

https://gerrit.wikimedia.org/r/500806

This comment was removed by Smalyshev.

OK it seems to be a bit unclear whether this was asking for revision IDs on particular entity or on the dump as a whole. I think that we need both, but the patch above seems to add the revision ID to entities. I think it makes sense.

@Smalyshev okay! Sorry if this is not the right place: I would be happy to migrate the patch to another ticket. Indeed the patch only adds entity-level metadata, not dump-level metadata. I think this would be less of a breaking change, given that it does not require changing the dump structure (and of course it is more useful to me, haha!)

It's a 4 year old task, so I myself is not 100% clear which one it was back then. So I think having it here is fine.

@Lydia_Pintscher we would need your thoughts about this.

In a nutshell, the proposal is to add the lastrevid field currently exposed in Special:EntityData and in the API (action=wbgetentities) to the JSON dumps too. This field stores the id of the revision which contains the entity as serialized. Currently, no revision information is included in the dumps at all. Because this only adds an extra field to JSON objects, without changing the rest of the structure, this should not break any reasonable consumer, especially if they are able to parse JSON representations from the API as well.

The patch is here: https://gerrit.wikimedia.org/r/500806 (I'll fix the issues picked up by Jenkins if you think the change is useful)

Thanks for working on this! :)

Yeah this seems fine. I think keeping the naming consistent with the same field in Special:EntityData is good.
Lucas and I also quickly talked about if this should go to the end or the beginning of the line and we're leaning to the end so the first thing you see is the entity id and type. But that's more a preference than a must.

Ok great! I'll move the field to the end and try to make Jenkins happy then.

Change 500806 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] dumps: Add lastrevid to JSON entity dumps.

https://gerrit.wikimedia.org/r/500806

Thanks all for your patience for this! Excited to see my first commit making it into Wikibase \o/

Regarding dump-level metadata, it would be super useful to know what timestamp should be passed to EventStreams for catching up with user edits after the dump was produced. To find this timestamp, can clients extract the entity ID with the highest lastrevid from a Wikidata dump, and then retrieve the corresponding timestamp via Special:EntityData like this? Or would a sync-up client loose some edits if it were to do this? (For example, if dumps get produced by parallel workers, they’d probably have to agree on a cut-off revision before starting the dumping process; otherwise, the JSON file wouldn’t necessarily contain all changes before the highest lastrevid in the dump file... correct?)