Page MenuHomePhabricator

Investigate ArangoDB for Wikidata Query
Closed, ResolvedPublic

Event Timeline

Manybubbles assigned this task to Smalyshev.
Manybubbles raised the priority of this task from to Needs Triage.
Manybubbles updated the task description. (Show Details)
Manybubbles moved this task to In Dev/Progress on the Wikidata-Query-Service board.
Manybubbles subscribed.
Manybubbles set Security to None.
Manybubbles renamed this task from Investigate ArangoDB to Investigate ArangoDB for Wikidata Query.Feb 4 2015, 5:20 PM
Manybubbles moved this task from Backlog to In Dev/Progress on the MediaWiki-Core-Team board.

Preliminary notes:

  • Graph is a set of vertex collections and edge collections, both document collection with some special attributes.
  • ArangoDB keeps indexes only in memory and rebuilds them on collection load. Which takes about 1 hour for the dump of 3M objects and 4 indexes.
  • Only one fulltext index per collection
  • Non-indexed search is not very fast, so we do need indexes
  • Buleprints support is for 2.5 and seems to be deprecated, so not sure if TP3 support ever happens. They have "generic graph support" which is not following Tinkerpop but uses their internal idioms.

I am having doubts their model would scale to our data sizes. Maybe sharding will help but we'd need quite a lot of shards then.

Looks like its out! Thanks for the report! Can you put it on the wiki
page and grab the next one on the list? Tinkerpop 3 supports Neo4j so that
might be quick to test, just replace titan with gremlin-neo4j in
wikidata-gremlin. Maybe.

Hi, I'm the CTO of ArangoDB, so my comments are most certainly biased. I still would like to tell you about our opinions on the raised issues, namely full-text indexes and blueprint.

(1) We do not believe that TP is helpful in a shared environment. Gremlin is a nice language, but it requires you to move a lot of data into the client. This works very well if you can embedded the database and keep it in the same process space. As soon as you need to shard the data and spread it to many servers you will move a lot of data between Gremlin and the DBservers. Therefore we decided to create a Javascript version of Gremlin which runs directly on the shards thus minimising the amount of moved data. Therefore it is indeed true, that we did not add support for TP3 because we believe it will be of limited use.

(2) Fulltext indexes are not our main expertise. We think that search engines like ElasticSearch, Solr are much better in this - especially when it comes to stemming, different languages, phonetic searches. There is an elastic search plugin to use ElasticSearch as fulltext search engine for ArangoDB. The fulltext index is indeed very slow when building. We want to speed up the process and hopefully can improve there over time (see also the next bullet point). I assume that you are using a fulltext index in your example, right?

(3) We decided to keep the indexes only in memory. The reason are as follows.

There are various possibilities:

(1) use memory only indexes (this is currently implemented in ArangoDB)
(2) use disk-based indexes (this is currently implemented in CouchDB)
(3) disk-backed with a file-system like clean flag
(4) other solutions like keeping only parts in memory, use memory as a cache, and so on are also possible

There is a trade-off:

Runtime behaviour:

(1) this is the fastest solution
(2) this is the slowest solution because you need to ensure that there are no inconsistencies even in case of a server crash. If you have a look at what CouchDB you will see what I mean. You need to do much more synching then in (1).
(3) could be nearly as fast as (1)

Startup behaviour:

(1) this is the slowest solution
(2) this is the fastest solution
(3) depends: with a clean shutdown as fast as (2), with a crash as slow as (1)

So if you expect your server to crash often, then (1) might not be a good idea. If you expect your server to run stable, then (1) might be much fast during normal operations. The best of all world would be (3). ArangoDB currently uses (1), but we want to switch to (3).

@Fceller many thanks to you for your explanations!

I certainly understand the reasons for your choices, however I'm still not sure how given our data sizes - ~16M vertices, ~100M edges, 2-3 thousands indexed fields - can be supported in current ArangoDB model. Maybe I am missing some opportunity or capability, please feel free to correct me.

I assume that you are using a fulltext index in your example, right?

No, I used two hash indexes and two skiplist indexes, on 2961954 documents (only vertices, no edges). Of course, it is a rough test, but we'd have to have much more indexes on about order of magnitude more data, so given that the loading time right now is over a hour, it is concerning, as it means if a server goes down, the system may be out of commission for hours at least.

@Smalyshev May I ask how much main memory your server has?

Also, can you elaborate I little bit, what you mean by "2-3 thousands indexed fields"? Are you going to have JSON documents with 2000 attributes, that you are going to index?

Thanks

Frank

@Fceller this was tested on my work machine, so it has about 8G of memory, are you saying giving it more memory would significantly improve things? How much memory should be enough to get under 10 mins loading data of this size?

what you mean by "2-3 thousands indexed fields"? Are you going to have JSON documents with 2000 attributes, that you are going to index?

There are thousands of indexable attributes, but not each document has every one of them - typical one would have tens, rarely up to a hundred. We need to be able to search by any of the attributes though and by any combination of them. Some of them will need only direct match (hash), some will need range/comparison or geospatial matching.

There are thousands of indexable attributes, but not each document has every one of them - typical one would have tens, rarely up to a hundred. We need to be able to search by any of the attributes though.

I see. That could be the root cause of the problem. The test dataset that you used, did it contain a lot of documents with unset attributes for the hash-index? Currently we do not support sparse indexes - but my colleague has started enhanced to allow them. I'll check with him.

@Fceller for the test, I used two of the attributes that are present in each document for hash indexes, and two that may or may not be present, for skiplist ones. In typical situation, most of the indexable attributes would not be present, but some will be always present (e.g. internal ID and type would be one).

There are thousands of indexable attributes, but not each document has every one of them - typical one would have tens, rarely up to a hundred. We need to be able to search by any of the attributes though.

I see. That could be the root cause of the problem. The test dataset that you used, did it contain a lot of documents with unset attributes for the hash-index? Currently we do not support sparse indexes - but my colleague has started enhanced to allow them. I'll check with him.

@Smalyshev Is this a test dataset? is it possible to get the generator and/or the data?

@Fceller it's a partial import of JSON dump from http://dumps.wikimedia.org/other/wikidata/ (for some reason not all the data loaded, maybe due to some error in the process, but it was enough for my purposes) with indexes:

{"id":"36457899233","type":"hash","unique":true,"fields":["id"]}
{"id":"36479132897","type":"hash","unique":false,"fields":["type"]}
{"id":"36568589537","type":"skiplist","unique":false,"fields":["sitelinks.enwiki"]}
{"id":"36700251361","type":"skiplist","unique":false,"fields":["sitelist.enwiki.badges"]}

Indexes are pretty random (except for id/type which will be one of the required ones). The import command used was:

gunzip -c ~/Downloads/20150126.json.gz | grep -v '^\[' | sed -e 's/},$/}/g' |  arangoimp --file - --type json --collection wikidata --overwrite true

For production use we'd probably have to restructure this data as in the dump form most of the data is not easily indexable, but for minimal viability test the dump seems to be enough.

This is a report about an actual experiment.

I downloaded 20150126.json.gz to an AWS r3.2xlarge instance (61GB RAM, 80 GB SSD, 8 vCPUs, shared hardware) and then used ArangoDB V 2.4.1 do do the import of the first 2961954 documents in the file. I created the same indexes as you and I used this command:

gunzip -c 20150126.json.gz | grep -v '^\[' | sed -e 's/},$/}/g' | head -n 2961954 | time arangoimp --file - --type json --collection wikidata --overwrite true

Here is the usage statistics:

  • The import took 15 minutes on that machine.
  • The database used at most 10.0 GB resident memory during the import.
  • After the WAL was flushed after the actual input it used 9.2 GB.
  • This comes from:
    • 6.811.237.720 bytes actual data (shaped), this is about 2300 b/document
    • 489.390.288 bytes of shape data, this is about 165 b/document
    • 672.169.048 bytes of data for all four indexes together, this is about 226 b/document
  • The actual data files on disk (data+shapes without indexes): 7.280.736.336 bytes, which is about 2692 b/document

For comparison: The raw (unzipped) JSON data for these documents were 7.390.007.296 bytes (as reported by arangoimp).

When I shut down the database server and restart it with loading this collection (which rebuilds the 4 indexes in memory), this takes about 263 seconds, which is well below 10 minutes. Explicit unloading is pretty fast. Reloading the collection in the still running server takes about as long.

Further experiments:

I dropped each of these indexes and recreated it. This isolates the rebuilding of the indexes from the loading (and scanning) of the collection files. Here are the timing results for the recreation (dropping is instantaneous):

  • unique hash index on "id": 3.0 s
  • hash index on "type": 2.1 s
  • skip list index on "sitelinks.enwiki": 215.8 s
  • skip list index on "sitelist.enwiki.badges": 25.8 s

Note that index generation is single threaded in version 2.4.

Disclaimer: Sorry, I forgot to introduce myself: My name is Max and I also work for ArangoDB.

Analysis

The 3M documents need around 11 GB of main memory. If you have less, then you see a lot of swapping, because the insert operation in the indexes will essentially do random accesses to the data files, since the indexed attribute data are not copied into the index but remain in the data files. This explains why you needed over an hour on an 8GB machine (which almost certainly does not have 8GB free!).

Given enough RAM, this effect does not happen and the building of the indexes is considerably faster, well below the 10 minutes given as upper limit. Interesting is that the second experiment suggests that it is the skiplist index that is essentially taking the time, which is not surprising since inserting into a skiplist of length N has complexity O(log(N)).

My guess is that the "sitelist.enwiki.badges" attribute is considerably sparser in this dataset, therefore the skiplist will quite often insert in the first position (inserting "null"). Once we have sparse indexes (we try to show you soon the first version of this to experiment with), the time for insertion of a document without a certain attribute into the corresponding index should be considerably faster.

Disclaimer: Sorry, I forgot to introduce myself: My name is Max and I also work for ArangoDB.

Thanks!

The 3M documents need around 11 GB of main memory. If you have less, then you see a lot of swapping, because the insert operation in the indexes will essentially do random accesses to the data files, since the indexed attribute data are not copied into the index but remain in the data files. This explains why you needed over an hour on an 8GB machine (which almost certainly does not have 8GB free!).

OK. Maybe its just a function of not using a super nice machine for testing. We really do want the system to scale down to work with less ram and cheap, big spinning disks so folks can run it on their laptop. That'll encourage experimentation.

Assuming we're OK with just planning for large server deployments: Does the memory requirement scale linearly with the size of the data? How does that play with sharding and replication? How large are the largest ArangoDB clusters?

Do you think sparse indexes are going to give us enough performance thousands of these indexes with similar startup times to what we see now? Is that something we can hack around by using fewer indexes and searching them in more interesting ways? Say we just make one index for all the badges and for every document with a badge we index an entry for its wiki, the badge name, and its wiki_its badge name. That way I can query all entries with "enwiki" badges. Or all entries with "featured" badges. Or all entries with "enwiki_featured" badges. We might be able to play similar clever tricks with the attributes.

Hi, I'm the CTO of ArangoDB, so my comments are most certainly biased. I still would like to tell you about our opinions on the raised issues, namely full-text indexes and blueprint.

Thanks for replying!

(1) We do not believe that TP is helpful in a shared environment. Gremlin is a nice language, but it requires you to move a lot of data into the client. This works very well if you can embedded the database and keep it in the same process space. As soon as you need to shard the data and spread it to many servers you will move a lot of data between Gremlin and the DBservers. Therefore we decided to create a Javascript version of Gremlin which runs directly on the shards thus minimising the amount of moved data. Therefore it is indeed true, that we did not add support for TP3 because we believe it will be of limited use.

Have a look at what they are working on now in their master branch - I think they've struck on a good notion: they _heavily_ deprecating predicates in place of anonymous filters. And those should be possible to optimize.

(2) Fulltext indexes are not our main expertise. We think that search engines like ElasticSearch, Solr are much better in this - especially when it comes to stemming, different languages, phonetic searches. There is an elastic search plugin to use ElasticSearch as fulltext search engine for ArangoDB. The fulltext index is indeed very slow when building. We want to speed up the process and hopefully can improve there over time (see also the next bullet point). I assume that you are using a fulltext index in your example, right?

That makes sense. I don't plan on using full text indexes in this project at all unless something unexpected comes up. Even so, we have much more experience with Elasticsearch and Lucene so it'd make sense to go there.

(3) We decided to keep the indexes only in memory. The reason are as follows.

There are various possibilities:

(1) use memory only indexes (this is currently implemented in ArangoDB)
(2) use disk-based indexes (this is currently implemented in CouchDB)
(3) disk-backed with a file-system like clean flag
(4) other solutions like keeping only parts in memory, use memory as a cache, and so on are also possible

There is a trade-off:

Runtime behaviour:

(1) this is the fastest solution
(2) this is the slowest solution because you need to ensure that there are no inconsistencies even in case of a server crash. If you have a look at what CouchDB you will see what I mean. You need to do much more synching then in (1).
(3) could be nearly as fast as (1)

Startup behaviour:

(1) this is the slowest solution
(2) this is the fastest solution
(3) depends: with a clean shutdown as fast as (2), with a crash as slow as (1)

So if you expect your server to crash often, then (1) might not be a good idea. If you expect your server to run stable, then (1) might be much fast during normal operations. The best of all world would be (3). ArangoDB currently uses (1), but we want to switch to (3).

You could also go with a Lucene-like write once behavior. I don't know that it'd be a good match at all though. It matches well with the infrequently updated asynchronous nature of full text search but it feels lie it'd be more troubling for something like ArangoDB. Also probably more work to implement than clean shutdown. Anyway, i'm sure you've spent more time thinking about it than I have.

OK. Maybe its just a function of not using a super nice machine for testing. We really do want the system to scale down to work with less ram and cheap, big spinning disks so folks can run it on their laptop. That'll encourage experimentation.

Well, ArangoDB is designed to be a "mostly-in-memory" database, which means that it really likes to keep all its data including indexes in main memory. The actual data is persisted using memory mapped files, but all access patterns essentially assume that all data is available in RAM. By the way, most of your candidate graph databases (Neo4j, OrientDB, Wikidata Query service) use this approach. Only Titan/Cassandra would run well with less RAM than data.
Therefore, if you want this, you have to use a corresponding engine, and this will probably mean Titan.
However, you also said that you want to index by essentially all attributes. If this is true, then you will face a dilemma with any database solution: Either at least some queries are untolerably slow, because the actual data or the particular index you access needs to be swapped in, or everything fits into RAM, in which case it will be cached there. This is because you have essentially unpredictable random accesses to your data.
So I would expect that even with an essentially disk-based database engine you will run into problems and due to the virtualisation features of modern OSes you will get the same performance behaviour with a mostly-in-memory database like ArangoDB for your scenario.

Assuming we're OK with just planning for large server deployments: Does the memory requirement scale linearly with the size of the data? How does that play with sharding and replication? How large are the largest ArangoDB clusters?

The memory requirements for the data files will scale linearly with the actual data size.
The memory requirements for the shapes will probably even scale sub-linearly with the actual data size, this is what I observed today and it sounds reasonable, since later data sets will be able to reuse existing shapes from earlier data sets.
The memory requirements for each index scales essentially linearly with the data but not continuously so, for example the memory requirement of a hash table jumps when it has to rehash. Skip lists do not show this behaviour.
Replication simply replays the actions of one server on another one, the memory usage will be identical on each machine.
Sharding distributes the data to different machines, which will all index their part only. ArangoDB does not have actual global indexes for a sharded collection. Queries are run against each locall index on each shard and the results are merged.
We do not know what the actual largest deployments of ArangoDB are. However, scalability will crucially depend on the queries that actuallly hit the database. For example simply finding all documents with a specfied value or range in one indexed field is trivially shardable and will be efficient with huge clusters. Certain unfortunate joins can be more problematic. For example graph traversals in a graph that is sharded in an unfavourable way can be a disaster.

The thing which worries me the most is the non-persistent indexes. Note that the real data size would be not 3M nodes but about 20M nodes and over 100M edges, which all need to be indexed. Or, if we convert edges to nodes for indexing, that'd be about 150M nodes and comparable number of edges, with indexing mostly going on on nodes. Which means, if we take about 3-4K per node, it's about 60-80G of memory not counting the edges. So it would probably run on 64G+ server, but you can pretty much forget about running it on a desktop/non-server machine. Even then, the question is - what if our data size doubles?

Now, going for the indexes times, if we take optimistic time of 3s/3M nodes, then for full data index we'd have about 30s time. If we have about 2000 properties to index, that brings us to 16 hours startup time, which doesn't sound good. Of course, that is assuming the times add linearly. If we take skiplist index times - which may be necessary on some of the properties, as they are numbers/quantities/dates, we'd have 200 to 2000 s per index, assuming linear scaling, which for thousands of indexed properties looks like it would take days to finish. Again, this is all under assumption all the numbers behave in linear fashion, for which I have no empiric substantiation. This is why I am worried about non-persistent indexes.

If we want to make the scale test, we could find a 64G server and try to load some synthetic test with nodes, edges and at least the roughly estimated number of indexes. It'd probably require writing a new tool for transforming the data, as the current one uses Tinkerpop3 and probably won't work in this case, but it's probably possible to do in a couple of days.

Do you think sparse indexes are going to give us enough performance thousands of these indexes with similar startup times to what we see now? Is that something we can hack around by using fewer indexes and searching them in more interesting ways? Say we just make one index for all the badges and for every document with a badge we index an entry for its wiki, the badge name, and its wiki_its badge name. That way I can query all entries with "enwiki" badges. Or all entries with "featured" badges. Or all entries with "enwiki_featured" badges. We might be able to play similar clever tricks with the attributes.

First of all, a sparse index will only consume memory for those documents which have the indexed attribute set. This is what makes it feasible at all to have "thousands of indexes". Secondly, inserting a document in thousands of indexes can take considerable time (see the timings above), however, a sparse index has to do nothing for a document that has its attribute value unset. So also for the insertion time the sparse indexes are needed.

I cannot really answer your question, in particular since it will depend on whether you have only "thousands of hash indexes" or even "thousands of skiplist indexes", as the above timings suggest. For 16M documents, the difference between O(1) and O(log(n)) complexity really matters (log(16M) is about 24, after all...). Furthermore, the actual sparsity of the attribute values for your indexes will matter.

Therefore, playing clever tricks to reduce the amount of indexes like you describe is definitely a good idea. A database engineer (of any flavour), should at least be a tiny bit scared when he reads "thousands of indexes", because no DB engine I know of is really happy about this prospect. Cassandra for example will, as far as I know, duplicate the data many (thousands of?) times to offer this type of indexing...

Furthermore, we have not yet talked about edges. How large is the data about your 100M edges? Do the edges carry substantial amounts of data themselves? Is there a sample of this data available online anywhere? Do you need to index the edges in any way? Please keep in mind that the edge collection will need at least the "edge-index" of its own...

Finally, for an informed decision about the database engine one would have to know what kind of queries will hit the database later in production. In particular for graph-like queries and queries mixing graph- with index lookups and possibly joins, one has to look carefully to see how they would perform, in particular with sharding. Do you have any information about the needed queries for your use case?

@Smalyshev: Please also note: in my tests today the 4 indexes for 3M documents needed 672 MB of data, which is a reasonable amount of 226 bytes per document. That should be about 3GB (optimistically!) for your 16M documents (forgetting about edges!) If you really have 2000 indexes then you would have to calculate with 1.5 TB of index data, which is probably totally impossible. Therefore it will only be possible with sparse indexes and your attriibutes really have to be sparse (for each attribute only a low percentage of documents has a value set). Then you will have considerably lower memory usage for the indexes. However, this will obviously also reduce the insertion time. So your 30s per index (for 16M) and thus the 16h are unrealistic in this scenario. I would imagine that the insertion time will be linear with the amount of memory the indexes use, and thus under control.

So in short: Either your sparsity assumptions hold and you use sparse indexes, or you are doomed anyway, simply because of the memory requirements. With sparsity, the startup time should not be such an issue.

I cannot really answer your question, in particular since it will depend on whether you have only "thousands of hash indexes" or even "thousands of skiplist indexes", as the above timings suggest. For 16M documents, the difference between O(1) and O(log(n)) complexity really matters (log(16M) is about 24, after all...). Furthermore, the actual sparsity of the attribute values for your indexes will matter.

We'd need the indexes that can do range queries - skiplist I presume.

Reality dictates sparsity here - we'll be pretty sparse. Properties that only make sense on people will rarely be on abstract concepts. There isn't anything from preventing it from time to time, but it should be rare.

Therefore, playing clever tricks to reduce the amount of indexes like you describe is definitely a good idea. A database engineer (of any flavour), should at least be a tiny bit scared when he reads "thousands of indexes", because no DB engine I know of is really happy about this prospect. Cassandra for example will, as far as I know, duplicate the data many (thousands of?) times to offer this type of indexing...

Lucene handles maintaining lots of indexes quite well. You can't query thousands of indexes at a time (you have to play tricks on that end) but you can maintain thousands of indexes - especially if most documents don't contain the fields.

Furthermore, we have not yet talked about edges. How large is the data about your 100M edges? Do the edges carry substantial amounts of data themselves? Is there a sample of this data available online anywhere? Do you need to index the edges in any way? Please keep in mind that the edge collection will need at least the "edge-index" of its own...

I can't think offhand of any indexes we'll need to edges but Stas probably knows a few.

Finally, for an informed decision about the database engine one would have to know what kind of queries will hit the database later in production. In particular for graph-like queries and queries mixing graph- with index lookups and possibly joins, one has to look carefully to see how they would perform, in particular with sharding. Do you have any information about the needed queries for your use case?

Lots of stuff. Lots of graph traversal stuff.

"List the 10 cities with the most population that have female mayors and are in Europe" <-- currently cities are actually listed as being in counties or regions so we'd either have to flatten that hierarchy or traverse. We'll still have to traverse to the mayor and check its gender.

"Find me all of the humans that were born before 1880 and don't have a date of death"

"Find me all of the humans who's father doesn't have that human listed as a child" "The mother"

"Return the family tree of George Washinton (say we know his id, good old Q23)"

"How many humans (instanceOf Q1) do we have data for?"

@Neunhoef In current data model, each edge carries a primary value, a boolean flag and a small set (usually well under 10, in most cases 1-3 or none) secondary values, each of which need to be indexed. It also can keep a set of auxiliary values for each of those, which are not indexed but may be used when filtering the results in complex lookups. The edges can, of course, be converted to nodes linked by property-less (or almost property-less) edges - the current model is used because of Titan storage model that stores nodes and edges together, so looking up edge property is much cheaper than traversing to a different node.

The query load would probably be both targeted lookups (is X a human? Is X alive or dead? Who is the current president of country X?), wider traversals (Give me the list of all ape species? Give me the list of countries sorted by population?) and even more wider lists (Give me the list of people born before 1800 that have no date of death? Give me the list of all female British writers?), etc. So the intent it to make indexed lookups very fast, but there may be a need to navigate a significant number of nodes, and unfortunately there is no any "natural" way of sharding the data as far as I can see.

Maybe we can take a step back and ignore the ArangoDB specifics for the moment. I'm also organising NoSQL conferences and consulting NoSQL in general.

Still, I must admit that I'm not familiar with the internal data model of Wikipedia. I've checked with George Washington (Q23) that he as a lot of properties associated with him. However, I fail to see how the traversals you mentioned are defined. For example "Give me the list of countries sorted by population?". How does the data model look like? "population" is an attribute of "country"? All countries are connected to a "special" node via an edge? Or are country identified by a special property? If there is a special "world" node containing edges to all countries, then there is no need for indexes. If there is a "world" node connecting everything and you need to filter the edges, indexes might help.

In general, graph model are very useful if you have paths of different length occurring in your query. For example, find a descendent with a given property. On the other hand, if your path always has a fixed length, it will be much faster to use some sort of indexes. Graph queries are fast, if you have natural start node. If you have to find nodes with a given property, it is much better to use document databases (see you example "find a city with a female mayor". Sometimes it is possible to combine both approaches. For examples, find cities with female mayors and then do a traversal from these cities. That is what I coin "multi-model". To be able to switch between models in an query. It is different from multi-personality approaches, where you have a database engine, that can be used as a document store or as a graph store - but not as both.

Having said that, I currently would not know which solution I would recommend to you. I'm sure I do not completely understand you data model and where graph are useful and where they are a hindrance. The same is true for the hardware. On one hand you want cheap hardware and spinning disk preferably even on a single node, on the other hand you dataset might require a cluster setup. Some of these requirements could be fulfilled by ArangoDB, some we would need to improve stuff (like finishing the spare indexes). On the other hand you might be better of with something like Elastic Search (if the graph searches are mostly fixed paths) or a combination.

Maybe we can take a step back and ignore the ArangoDB specifics for the moment. I'm also organising NoSQL conferences and consulting NoSQL in general.

Still, I must admit that I'm not familiar with the internal data model of Wikipedia. I've checked with George Washington (Q23) that he as a lot of properties associated with him. However, I fail to see how the traversals you mentioned are defined. For example "Give me the list of countries sorted by population?". How does the data model look like? "population" is an attribute of "country"? All countries are connected to a "special" node via an edge? Or are country identified by a special property? If there is a special "world" node containing edges to all countries, then there is no need for indexes. If there is a "world" node connecting everything and you need to filter the edges, indexes might help.

Sure. Firstly, this is Wikidata, not Wikipedia. From a software perspective Wikipedia is "just" a fully tricked out MediaWiki instance - its data model concerns itself with things like pages, revisions of pages, and templates. Its a wiki, a tool for building pages.

Wikidata is different. Its a tool for collaboratively editing structured data. At its core its written on MediaWiki as well and somewhere in its depths it serializes the structured data to json blobs and saves it in the same table that stores revisions on Wikipedia (different actual database, same table). That technical background is only really relevant in that it tells you that the native storage isn't particularly queryable.

So this tool's job isn't to be the primary data store at all. Its job is to make wikidata searchable as a knowledge graph.

The simplest way to find all countries (assuming you have a graph database) is start with country (Q6256) and then trace all incoming links of the instance of type (P31).

In general, graph model are very useful if you have paths of different length occurring in your query. For example, find a descendent with a given property. On the other hand, if your path always has a fixed length, it will be much faster to use some sort of indexes. Graph queries are fast, if you have natural start node. If you have to find nodes with a given property, it is much better to use document databases (see you example "find a city with a female mayor". Sometimes it is possible to combine both approaches. For examples, find cities with female mayors and then do a traversal from these cities. That is what I coin "multi-model". To be able to switch between models in an query. It is different from multi-personality approaches, where you have a database engine, that can be used as a document store or as a graph store - but not as both.

I think of that as a matter of optimization though. I'm not sure I can justify collapsing mayor gender into all cities, for example. I can almost certainly justify collapsing country into cities, though. This isn't anything unique to a particular technology though.

Having said that, I currently would not know which solution I would recommend to you. I'm sure I do not completely understand you data model and where graph are useful and where they are a hindrance. The same is true for the hardware. On one hand you want cheap hardware and spinning disk preferably even on a single node, on the other hand you dataset might require a cluster setup. Some of these requirements could be fulfilled by ArangoDB, some we would need to improve stuff (like finishing the spare indexes). On the other hand you might be better of with something like Elastic Search (if the graph searches are mostly fixed paths) or a combination.

The thing has to be installed in multiple places - certainly in production where we can spend money on ram and ssds and such. But we'll also install in labs which is VMs on shared storage. Its acceptable for labs to perform worse. Its not acceptable for the install to be impossible. The same goes for machines for researchers/tinkerers/bot authors. Those can be slow so long as they can use it at all.

Still, I must admit that I'm not familiar with the internal data model of Wikipedia. I've checked with George Washington (Q23) that he as a lot of properties associated with him. However, I fail to see how the traversals you mentioned are defined.

If you're interested, you can look at Titan-based data model here: https://www.mediawiki.org/wiki/Wikibase/Indexing/Data_Model but in simpler terms, each entity (like Q23, aka George Washington) has a number of statements about it (which are something like "born in", "instance of", "part of", "located in", "served in office as", "cause of death", etc. - these are P31, etc.). These statements have main value - which can be a scalar value or a link to another entity. The statement can also have qualifiers (i.e. when it happened, where it happened, etc.) and references (i.e. where do we know it from, when that data was retrieved, etc.)

The search may involve both traversals of unknown length (i.e. going from specific place to a country), one-step traversals (i.e. knowing if the entity is "human" or "female" or the office that the person had is "the president of the USA") and filters (i.e. "it happened in 19th century", "it is located in Europe"), etc.) which can be applied to both values and qualifiers (and, potentially, references).

Right now the data model is schema-neutral - i.e. it does not distinguish between claims and properties and delegates that to the query engine to figure out, for example, how to go from a town to a country. This may change in the future as we may add post-processing allowing to create additional schema links - such as direct link from a town to a country - which is not present in the source data (i.e. Wikidata). This, however, will never cover all queries as any set of properties can be queried against and any path can be traversed. So, we need a model that would allow us to support such lookups. Understandably, some of them would be easier to do and some harder, so we need a model that makes frequent cases easy/fast and complex cases possible.