Feb 6 2015

Neunhoef added a comment to T88549: Investigate ArangoDB for Wikidata Query.

@Smalyshev: Please also note: in my tests today the 4 indexes for 3M documents needed 672 MB of data, which is a reasonable amount of 226 bytes per document. That should be about 3GB (optimistically!) for your 16M documents (forgetting about edges!) If you really have 2000 indexes then you would have to calculate with 1.5 TB of index data, which is probably totally impossible. Therefore it will only be possible with sparse indexes and your attriibutes really have to be sparse (for each attribute only a low percentage of documents has a value set). Then you will have considerably lower memory usage for the indexes. However, this will obviously also reduce the insertion time. So your 30s per index (for 16M) and thus the 16h are unrealistic in this scenario. I would imagine that the insertion time will be linear with the amount of memory the indexes use, and thus under control.

Feb 6 2015, 10:29 PM · MediaWiki-Core-Team, Wikidata-Query-Service

Neunhoef added a comment to T88549: Investigate ArangoDB for Wikidata Query.

Do you think sparse indexes are going to give us enough performance thousands of these indexes with similar startup times to what we see now? Is that something we can hack around by using fewer indexes and searching them in more interesting ways? Say we just make one index for all the badges and for every document with a badge we index an entry for its wiki, the badge name, and its wiki_its badge name. That way I can query all entries with "enwiki" badges. Or all entries with "featured" badges. Or all entries with "enwiki_featured" badges. We might be able to play similar clever tricks with the attributes.

Feb 6 2015, 10:16 PM · MediaWiki-Core-Team, Wikidata-Query-Service

Neunhoef added a comment to T88549: Investigate ArangoDB for Wikidata Query.

Assuming we're OK with just planning for large server deployments: Does the memory requirement scale linearly with the size of the data? How does that play with sharding and replication? How large are the largest ArangoDB clusters?

The memory requirements for the data files will scale linearly with the actual data size.
The memory requirements for the shapes will probably even scale sub-linearly with the actual data size, this is what I observed today and it sounds reasonable, since later data sets will be able to reuse existing shapes from earlier data sets.
The memory requirements for each index scales essentially linearly with the data but not continuously so, for example the memory requirement of a hash table jumps when it has to rehash. Skip lists do not show this behaviour.
Replication simply replays the actions of one server on another one, the memory usage will be identical on each machine.
Sharding distributes the data to different machines, which will all index their part only. ArangoDB does not have actual global indexes for a sharded collection. Queries are run against each locall index on each shard and the results are merged.
We do not know what the actual largest deployments of ArangoDB are. However, scalability will crucially depend on the queries that actuallly hit the database. For example simply finding all documents with a specfied value or range in one indexed field is trivially shardable and will be efficient with huge clusters. Certain unfortunate joins can be more problematic. For example graph traversals in a graph that is sharded in an unfavourable way can be a disaster.

Feb 6 2015, 10:01 PM · MediaWiki-Core-Team, Wikidata-Query-Service

Neunhoef added a comment to T88549: Investigate ArangoDB for Wikidata Query.

OK. Maybe its just a function of not using a super nice machine for testing. We really do want the system to scale down to work with less ram and cheap, big spinning disks so folks can run it on their laptop. That'll encourage experimentation.

Well, ArangoDB is designed to be a "mostly-in-memory" database, which means that it really likes to keep all its data including indexes in main memory. The actual data is persisted using memory mapped files, but all access patterns essentially assume that all data is available in RAM. By the way, most of your candidate graph databases (Neo4j, OrientDB, Wikidata Query service) use this approach. Only Titan/Cassandra would run well with less RAM than data.
Therefore, if you want this, you have to use a corresponding engine, and this will probably mean Titan.
However, you also said that you want to index by essentially all attributes. If this is true, then you will face a dilemma with any database solution: Either at least some queries are untolerably slow, because the actual data or the particular index you access needs to be swapped in, or everything fits into RAM, in which case it will be cached there. This is because you have essentially unpredictable random accesses to your data.
So I would expect that even with an essentially disk-based database engine you will run into problems and due to the virtualisation features of modern OSes you will get the same performance behaviour with a mostly-in-memory database like ArangoDB for your scenario.

Feb 6 2015, 9:51 PM · MediaWiki-Core-Team, Wikidata-Query-Service

Neunhoef added a comment to T88549: Investigate ArangoDB for Wikidata Query.

Disclaimer: Sorry, I forgot to introduce myself: My name is Max and I also work for ArangoDB.

Neunhoef (Max Neunhöffer)
User

Projects

Calendar

Today

Tomorrow

Wednesday

User Details

Recent Activity
View All

Feb 6 2015

Neunhoef (Max Neunhöffer)User

Projects

Calendar

Today

Tomorrow

Wednesday

User Details

Recent ActivityView All

Feb 6 2015

Neunhoef (Max Neunhöffer)
User

Recent Activity
View All