Wikidata dumps should have revision ID or other sequence mark
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Smalyshev
	Jan 21 2015, 12:08 AM

Description

While Wikidata JSON exports (https://www.wikidata.org/wiki/Special:EntityData/Q1.json) currently have revision ID, Wikidata JSON dumps do not have any revision IDs or markers like that. That makes it harder to both integrate the data and assert the up-to-date-ness of the data for third-party services. I think we need some identifier - may be Mediawiki release ID or running counter or timestamp or anything else - that would be present in all recommended forms of getting data from Wikibase and will allow to track up-to-dateness of the data entry.

Details

	Subject	Repo	Branch	Lines +/-
	dumps: Add lastrevid to JSON entity dumps.	mediawiki/extensions/Wikibase	master	+58 -71

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Open	None	T87283 Wikidata dumps should have revision ID or other sequence mark

Event Timeline

Smalyshev created this task.Jan 21 2015, 12:08 AM

Smalyshev raised the priority of this task from to Needs Triage.

Smalyshev updated the task description. (Show Details)

Smalyshev subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2015, 12:08 AM

Smalyshev set Security to None.Jan 21 2015, 12:09 AM

Smalyshev added projects: Wikibase-DataModel, Wikibase-DataModel-JavaScript.

Smalyshev added a subscriber: Wikidata-bugs.

Lydia_Pintscher triaged this task as Medium priority.Jan 21 2015, 4:14 PM

Lydia_Pintscher edited projects, added Wikidata; removed Wikibase-DataModel-JavaScript, Wikibase-DataModel.

Lydia_Pintscher moved this task from incoming to needs discussion or investigation on the Wikidata board.

Lydia_Pintscher edited subscribers, added: hoo, daniel, JanZerebecki and 2 others; removed: Wikidata-bugs.

Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking].Feb 9 2015, 4:03 PM

Jimkont subscribed.Mar 11 2015, 2:29 PM

Denis.bykov subscribed.Apr 21 2015, 8:04 AM

Ricordisamoa subscribed.Oct 6 2015, 6:22 AM

TorbenRahbekKoch subscribed.Feb 3 2016, 5:53 PM

Tagging T92961: [Story] Versioning in JSON output as it is relevant

Pintoch awarded a token.Apr 2 2019, 9:06 AM

I am wondering what is the status of this: is more discussion needed about what version information to include, or are we simply waiting for a patch?

I vote for returning the same serialization as in Special:EntityData: this would provide timestamp, revision id, page id, so that consumers can use whatever they want and get a consistent output in the API.

If there is consensus for that, and if directed to the relevant part of the code, I could contribute a patch.

Change 500806 had a related patch set uploaded (by Pintoch; owner: Pintoch):
[mediawiki/extensions/Wikibase@master] dumps: Add lastrevid to JSON entity dumps.

https://gerrit.wikimedia.org/r/500806

gerritbot added a project: Patch-For-Review.Apr 2 2019, 6:48 PM

Smalyshev added a comment.Apr 2 2019, 6:54 PM

This comment was removed by Smalyshev.

OK it seems to be a bit unclear whether this was asking for revision IDs on particular entity or on the dump as a whole. I think that we need both, but the patch above seems to add the revision ID to entities. I think it makes sense.

@Smalyshev okay! Sorry if this is not the right place: I would be happy to migrate the patch to another ticket. Indeed the patch only adds entity-level metadata, not dump-level metadata. I think this would be less of a breaking change, given that it does not require changing the dump structure (and of course it is more useful to me, haha!)

It's a 4 year old task, so I myself is not 100% clear which one it was back then. So I think having it here is fine.

daniel unsubscribed.Apr 2 2019, 7:57 PM

@Lydia_Pintscher we would need your thoughts about this.

In a nutshell, the proposal is to add the lastrevid field currently exposed in Special:EntityData and in the API (action=wbgetentities) to the JSON dumps too. This field stores the id of the revision which contains the entity as serialized. Currently, no revision information is included in the dumps at all. Because this only adds an extra field to JSON objects, without changing the rest of the structure, this should not break any reasonable consumer, especially if they are able to parse JSON representations from the API as well.

The patch is here: https://gerrit.wikimedia.org/r/500806 (I'll fix the issues picked up by Jenkins if you think the change is useful)

Thanks for working on this! :)

Yeah this seems fine. I think keeping the naming consistent with the same field in Special:EntityData is good.
Lucas and I also quickly talked about if this should go to the end or the beginning of the line and we're leaning to the end so the first thing you see is the entity id and type. But that's more a preference than a must.

Ok great! I'll move the field to the end and try to make Jenkins happy then.

Change 500806 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] dumps: Add lastrevid to JSON entity dumps.

https://gerrit.wikimedia.org/r/500806

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.5; 2019-05-14).May 13 2019, 4:01 PM

Thanks all for your patience for this! Excited to see my first commit making it into Wikibase \o/

\o/

Ladsgroup removed a project: Patch-For-Review.May 28 2019, 3:37 PM

Sascha mentioned this in T209390: Output some meta data about the wikidata JSON dump.Apr 28 2021, 8:19 AM

Sascha subscribed.Apr 28 2021, 8:24 AM

Regarding dump-level metadata, it would be super useful to know what timestamp should be passed to EventStreams for catching up with user edits after the dump was produced. To find this timestamp, can clients extract the entity ID with the highest lastrevid from a Wikidata dump, and then retrieve the corresponding timestamp via Special:EntityData like this? Or would a sync-up client loose some edits if it were to do this? (For example, if dumps get produced by parallel workers, they’d probably have to agree on a cut-off revision before starting the dumping process; otherwise, the JSON file wouldn’t necessarily contain all changes before the highest lastrevid in the dump file... correct?)

Addshore unsubscribed.Jun 27 2023, 12:42 PM

Wikidata dumps should have revision ID or other sequence markOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Wikidata dumps should have revision ID or other sequence mark
Open, MediumPublic
Actions

Related Objects
Search...