Page MenuHomePhabricator

Two kinds of JSON dumps?
Open, LowestPublic

Description

On the #wikimedia-de-tech IRC channel, the idea of having two kinds of JSON dumps came up: compact dumps and fully expanded dumps. One difference would be that full dumps would include snak hashes (see T171607: Main snak and reference snaks do not include hash in JSON output), but there might be other things we could include/exclude as well.

I’ll leave this for the CC’d people to discuss :)

Event Timeline

Some questions for those who know the details:

  • Are the snak hashes calculated at dump time or is this just another static field to be dumped?
  • What other fields are under consideration?
  • How much longer would it take to do these runs?
  • How much bigger would they be for downloaders?

Are the snak hashes calculated at dump time or is this just another static field to be dumped?

Hashes are not meant to be stored in the database, but calculated every time they are needed.

What other fields are under consideration?

I assume this refers to secondary values, e.g. normalized quantity values (inches normalized to meter and such), full URIs for external identifiers, and such. These should not be included in a minimal dump, but might be included in an expanded dump.

How much longer would it take to do these runs?

Runtime is not that much of a problem, as far as I'm aware of.

How much bigger would they be for downloaders?

That's a good question. I assume it might be something between 1% and 10%, possibly. The hashes we are talking about here are mostly SHA1 hashes, in 40 characters hexadecimal form. These are not that well compressible.

I'd lke to see a comparison of the run times and download sizes; if there's "not much difference" (for some value of "not much"), maybe we just want to run fuller dumps exclusively.

I would vote for simply including hashes in dumps. They would make dumps bigger, but they would be consistent with output of EntityData which currently includes hashes for all snaks.