In addition or alternatively to T85101: create index for each dump and similar to T119612: Consider a serialization that supports random access for storage in the DB for Wikidata (see there for which formats to consider) instead of JSON consider a serialization that supports random access and possibly can even be used for in memory representation for the dump.
|Open||None||T88728 Improve Wikimedia dumping infrastructure|
|Open||None||T88991 improve Wikidata dumps [tracking]|
|Duplicate||None||T119613 consider a serialization that supports random access for the dump|
What granularity of random access? Entities? Or Statements?
CDB is probably a good fit for compact write-once random-access files. We could just put the regular JSON blobs in there. We could also have the statements as separate top level elements, if we want to.
CDB wouldn't work for dumps as it is limited to 4GB, according to https://en.wikipedia.org/wiki/Cdb_%28software%29 .
I was tinking of both Entites and Statements within Entities because the latter would also help deferred deserialization. These may be solved by two different formats nested in each other. But what this would actually be used for by whom is not yet clear.
Something like BJSON, Protocol Buffers or EXI, where you can know the length of something without searching.