Two kinds of JSON dumps?
Open, LowestPublic
Actions

Assigned To

None

Authored By

	• Lucas_Werkmeister_WMDE
	Aug 24 2017, 2:33 PM

Description

On the #wikimedia-de-tech IRC channel, the idea of having two kinds of JSON dumps came up: compact dumps and fully expanded dumps. One difference would be that full dumps would include snak hashes (see T171607: Main snak and reference snaks do not include hash in JSON output), but there might be other things we could include/exclude as well.

I’ll leave this for the CC’d people to discuss :)

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Open	None	T174029 Two kinds of JSON dumps?

Event Timeline

• Lucas_Werkmeister_WMDE created this task.Aug 24 2017, 2:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2017, 2:33 PM

• Lucas_Werkmeister_WMDE mentioned this in T171607: Main snak and reference snaks do not include hash in JSON output.Aug 24 2017, 2:33 PM

• Lucas_Werkmeister_WMDE mentioned this in T174692: Announce changes regarding inclusion of snak hashes.Aug 31 2017, 4:07 PM

thiemowmde triaged this task as Lowest priority.Sep 4 2017, 1:07 PM

thiemowmde moved this task from incoming to needs discussion or investigation on the Wikidata board.

thiemowmde added projects: MediaWiki-extensions-WikibaseRepository, patch-welcome.

thiemowmde added subscribers: thiemowmde, WMDE-leszek, aude.

thiemowmde added a subscriber: Lydia_Pintscher.

Some questions for those who know the details:

Are the snak hashes calculated at dump time or is this just another static field to be dumped?
What other fields are under consideration?
How much longer would it take to do these runs?
How much bigger would they be for downloaders?

Are the snak hashes calculated at dump time or is this just another static field to be dumped?

Hashes are not meant to be stored in the database, but calculated every time they are needed.

What other fields are under consideration?

I assume this refers to secondary values, e.g. normalized quantity values (inches normalized to meter and such), full URIs for external identifiers, and such. These should not be included in a minimal dump, but might be included in an expanded dump.

How much longer would it take to do these runs?

Runtime is not that much of a problem, as far as I'm aware of.

How much bigger would they be for downloaders?

That's a good question. I assume it might be something between 1% and 10%, possibly. The hashes we are talking about here are mostly SHA1 hashes, in 40 characters hexadecimal form. These are not that well compressible.

I'd lke to see a comparison of the run times and download sizes; if there's "not much difference" (for some value of "not much"), maybe we just want to run fuller dumps exclusively.

I would vote for simply including hashes in dumps. They would make dumps bigger, but they would be consistent with output of EntityData which currently includes hashes for all snaks.

Addshore unsubscribed.Jun 27 2023, 12:34 PM

Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking].Apr 8 2024, 12:00 PM

Two kinds of JSON dumps?Open, LowestPublicActions

Description

Related ObjectsSearch...

Event Timeline

Two kinds of JSON dumps?
Open, LowestPublic
Actions

Related Objects
Search...