Page MenuHomePhabricator

consider a serialization that supports random access for the dump
Closed, DuplicatePublic

Description

In addition or alternatively to T85101: create index for each dump and similar to T119612: Consider a serialization that supports random access for storage in the DB for Wikidata (see there for which formats to consider) instead of JSON consider a serialization that supports random access and possibly can even be used for in memory representation for the dump.

Event Timeline

JanZerebecki raised the priority of this task from to High.
JanZerebecki updated the task description. (Show Details)
JanZerebecki added a project: Wikidata.
JanZerebecki added a subscriber: JanZerebecki.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2015, 11:07 AM
daniel added a subscriber: daniel.Nov 26 2015, 12:20 PM

What granularity of random access? Entities? Or Statements?

CDB is probably a good fit for compact write-once random-access files. We could just put the regular JSON blobs in there. We could also have the statements as separate top level elements, if we want to.

CDB wouldn't work for dumps as it is limited to 4GB, according to https://en.wikipedia.org/wiki/Cdb_%28software%29 .

I was tinking of both Entites and Statements within Entities because the latter would also help deferred deserialization. These may be solved by two different formats nested in each other. But what this would actually be used for by whom is not yet clear.

From T119612:

Something like BJSON, Protocol Buffers or EXI, where you can know the length of something without searching.

We could split the dump to stay below the 4 GB limit... Not nice, but probably still better than inventing our own format.