Page MenuHomePhabricator

Output some meta data about the wikidata JSON dump
Open, Needs TriagePublic

Description

https://stackoverflow.com/questions/48762413/how-to-get-the-cutoff-timestamp-or-lastrevid-for-a-given-wikidata-json-dump

People want to know what the most recent revision is that is included in the wikidata JSON dump.

I imagine this isn't as simple as it would seem due to sharding of the dump process, but definitely still possible.

Event Timeline

Adding @Smalyshev who has been working with the wikidata weekly dumps recently; if someone else is a better contact person, please feel free to remove yourself and add them.

Personally, I would love to have for each item in the dump a timestamp when it was created and a timestamp when it was last modified.

Related: https://phabricator.wikimedia.org/T278031

I realized I have exactly the same need as poster on StackOveflow: get a dump and then using real-time feed to keep it updated. But you have to know where to start with the real-time feed through EventStreams, using historical consumption to resume from the point the dump wasmade.

To find the timestamp of the last Wikidata change that went into a dump file, couldn’t one — while processing the dump — extract the entity and revision ID with the highest lastrevid value in the entire dump, and then retrieve the corresponding modified timestamp for that single edit via Special:EntityData like in this query? The lastrevid field seems to have been added to dumps by T87283 in changeset 500806.

Are you sure lastrevid works like that for the whole dump? I think that dump is made from multiple shards, so it might be that lastrevid is not consistent across all items?

Hm, good point. Could the dumps be made consistent? Maybe like this: Before starting a dump, find the current last revision; pass this cut-off revision ID to the dumping shards; change the dump-producing code to not consider changes after the cut-off revision. But I wouldn’t know how hard this would be. Actually, DumpEntities already seems to take a last-page-id flag, but I don’t know if/where that is getting set in production (and if that’s really enough).

I am proactively adding @hoo as he can provide some insight and perhaps tag others as well.