Page MenuHomePhabricator

[Epic] Evaluate alternatives to Blazegraph
Open, HighPublic

Description

Since Blazegraph project seems to not be active anymore (last commit 2 years ago at https://github.com/blazegraph/database) we need to evaluate if we want to switch to graph DB project that is more actively supported/developed.

The requirements should be:

  • Full SPARQL 1.1 support, including SPARQL Update
  • Open source
  • Can load and run queries on full Wikidata database

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Lydia_Pintscher mentioned a conversation with the Data Commons team at Google, they have this opensource codebase that's somewhat in this area: https://github.com/datacommonsorg/mixer

It looks like Data Commons is not RDF native. And its "SPARQL endpoint" is an approximation of a SPARQL endpoint. That is, it is missing lots of things that make SPARQL endpoints useful to work with.

e.g. Here is a query that works:

curl -X POST 'https://api.datacommons.org/query' \
-d '{"sparql": "SELECT ?state ?name \
                WHERE { \
                  ?state typeOf State . \
                  ?state name ?name \
                }"}'
  • but it can't handle select *
  • notice how the predicates don't use a prefix
  • it doesn't respect a header like 'Accept: text/csv'

In fairness they do say:

Graph Query/SPARQL: Given a subgraph where some of the nodes are variables, retrieve possible matches. This corresponds to a subset of the graph query language SPARQL.

https://docs.datacommons.org/api/

And in the code I see:
// Translate takes a datalog query and translates to GoogleSQL query based on schema mapping.
https://github.com/datacommonsorg/mixer/blob/b7518499a19360cc4ee33898feb8e26f81cc060b/internal/translator/translate.go#L853

But if you look how the Translate function gets called I think they mean "takes a SPARQL query."

Sj added a subscriber: Sj.

+1 for Oxigraph. @Tpt has been putting in a ton of good effort, research, features, and stability. Sponsoring him now in GitHub as well for his effort.
As it's being developed in Rust, it automatically takes advantage of data streaming in places that utilizes intrinsic functions (forwarded through LLVM compiler IR) in CPU's. Java 17 is just now getting into a better position with it's new Vector API. On top of that, the RIO Parser is one of the fastest RDF parsers I've seen run on my system, which he also graciously maintains in Rust.

For the record.

At the time of our first rendezvous re Wikidata hosting, handling 20 billion+ triples would have typically required our Cluster Edition (a Commercial Only offering). That was the deal-breaker back at the time of initial Blazegraph selection for Wikidata i.e., it offered an Open Source based Cluster Edition.

Anyway, in recent times, our Open Source Edition has evolved to handle some 80 Billion+ triples (exemplified by the live Uniprot instance) where performance and scale is primary a function of available memory. Fundamentally, the current 13 Billion Triples size of Wikidata and future growth all lie well within the range of Virtuoso's Open Source Edition.

Also note, based on our experience hosting live DBpedia and Wikidata instances, we do have configuration best practices in place for uptime and scalability without the need for our Cluster Edition (which is really for dealing with massive setups in the 100 Billion Triples or higher range).

I hope this helps.

Related

[1] Our Live Wikidata SPARQL Query Endpoint
[2] Google Spreadsheet about various Virtuoso Configurations associated with some well-known public endpoints
[3] This query doesn't complete with the current Blazegraph-based Wikidata endpoint
[4] Same query completing when applied to the Virtuoso-based endpoint
[5] About loading Wikidata's datasets into a Virtuoso instance
[6] Various demos shared via Twitter over the years regarding Wikidata
[7] Uniprot SPARQL Endpoint Presentation

Since writing this, I have thought more about SPARQL vs Gremlin. The advantages of the former seem to be

  1. we already are accustomed to it and have it hard-coded in a lot of tools
  2. it is an open standard

A move to a Tinkerpop based infrastructure (thus aligning with the big players in the big data graph database industry) might be possible while keeping SPARQL also! See https://github.com/tinkerpop/gremlin/wiki/SPARQL-vs.-Gremlin where Gremlin over Sail is mentioned.
See https://github.com/tinkerpop/blueprints/wiki/Sail-Implementation#mapping-rdf-to-a-property-graph and https://github.com/joshsh/graphsail <- not maintained since 2017 :/

It will incur a lot of extra costs to change away from SPARQL this late in the endeavor which has to be possible to avoid if not in other ways then by transpiling our queries to Gremlin (I found no transpiler than supports property paths, see https://github.com/LITMUS-Benchmark-Suite/sparql-to-gremlin).

I really hope we can find or build a suitable, stable and scalable backend that can handle the load of the next 10 years.

I recommend the Search Team to attend the ApacheCon to gather insights from others using Big Data succesfully:
https://www.apachecon.com/acah2021/tracks/bigdata.html
e.g.:

  • Containing an Elephant: How we moved Hadoop/HBase into Kubernetes and Public Cloud

[1] Our Live Wikidata SPARQL Query Endpoint

Thanks, Kingsley. We tried your Wikidata SPARQL endpoint, in particular some queries with an ORDER BY in the end, which require the computation of a fairly large intermediate result. These queries are hard because even if you want just the top-ranked items in the end, you have to compute all of them first in order to find out which of them are the top-ranked ones.

Your Wikidata SPARQL endpoint computed results for those queries surprisingly quickly, but the result do not seem to be correct.

Here is an example query (all people and their professions, ordered by the number of professions). The top result returned by your SPARQL endpoint is "Janet Jackson" with 15 professions. But there are a number of people with more professions. For example, Johann Wolfgang von Goethe (http://www.wikidata.org/entity/Q5879) with 30 professions. One can also find him with your endpoint by adding FILTER (?person_id = http://www.wikidata.org/entity/Q5879) to the query below. So something strange is going on. Can you explain?

SELECT ?person_id ?person (COUNT(?profession_id) AS ?count) (GROUP_CONCAT(?profession; separator=", ") AS ?professions) WHERE {
  ?person_id wdt:P31 wd:Q5 .
  ?person_id wdt:P106 ?profession_id .
  ?profession_id rdfs:label ?profession .
  ?person_id rdfs:label ?person .
  FILTER (LANG(?person) = "en") .
  FILTER (LANG(?profession) = "en")
}
GROUP BY ?person_id ?person
ORDER BY DESC(?count)

That's a consequence of the "Anytime Query" feature in Virtuoso that provides partial solutions in situations where a query cannot be completed within a specific timeframe. This timeframe takes the form of a configurable timeout, and is an critical feature for enabling global ad-hoc query access, 24/7, 365 days a year re the likes of DBpedia and Wikidata.

When a partial query is returned, information is delivered via the HTTP response as per:

curl -I "https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=http%3A%2F%2Fwww.wikidata.org%2F&query=PREFIX+parl%3A+%3Chttps%3A%2F%2Fid.parliament.uk%2Fschema%2F%3E%0D%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0D%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+wikibase%3A+%3Chttp%3A%2F%2Fwikiba.se%2Fontology%23%3E%0D%0APREFIX+bd%3A+%3Chttp%3A%2F%2Fwww.bigdata.com%2Frdf%23%3E%0D%0APREFIX+dct%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0APREFIX+dbpedia%3A+%3Chttp%3A%2F%2Fdbpedia.org%2F%3E+%0D%0APREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0APREFIX+wds%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2Fstatement%2F%3E%0D%0APREFIX+wdv%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fvalue%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0APREFIX+p%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0D%0APREFIX+ps%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fstatement%2F%3E%0D%0APREFIX+pq%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fqualifier%2F%3E%0D%0A%0D%0ASELECT+%3Fperson_id+%3Fperson+%28COUNT%28%3Fprofession_id%29+AS+%3Fcount%29+%28GROUP_CONCAT%28%3Fprofession%3B+separator%3D%22%2C+%22%29+AS+%3Fprofessions%29+WHERE+%7B%0D%0A++%3Fperson_id+wdt%3AP31+wd%3AQ5+.%0D%0A++%3Fperson_id+wdt%3AP106+%3Fprofession_id+.%0D%0A++%3Fprofession_id+rdfs%3Alabel+%3Fprofession+.%0D%0A++%3Fperson_id+rdfs%3Alabel+%3Fperson+.%0D%0A++FILTER+%28LANG%28%3Fperson%29+%3D+%22en%22%29+.%0D%0A++FILTER+%28LANG%28%3Fprofession%29+%3D+%22en%22%29%0D%0A%7D%0D%0AGROUP+BY+%3Fperson_id+%3Fperson%0D%0AORDER+BY+DESC%28%3Fcount%29&format=text%2Fx-html%2Btr&timeout=30000000&signal_void=on&signal_unconnected=on"

Which returns:

HTTP/1.1 200 OK
Date: Thu, 09 Sep 2021 17:53:51 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 479290
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/08.03.3320 (Linux) x86_64-generic-linux-glibc25 VDB
Accept-Ranges: bytes
X-SPARQL-default-graph: http://www.wikidata.org/
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1.075M rnd 530.1K seq 16.78K same seg 109.7K same pg 14.13K same par 0 disk 0 spec disk 0B / 0
X-Exec-Milliseconds: 2031
X-Exec-DB-Activity: 1.075M rnd 530.1K seq 16.78K same seg 109.7K same pg 14.13K same par 0 disk 0 spec disk 0B / 0 messages 0 fork
Content-disposition: filename=sparql_2021-09-09_17-53-51Z.html
Expires: Thu, 09 Sep 2021 18:53:51 GMT
Cache-Control: max-age=3600
Strict-Transport-Security: max-age=15768000

Related

[1] DBpedia Fair Use Note
[2] Virtuoso Anytime Query Tips & Tricks Note

In some cases that feature could be useful, though I suspect many would be surprised to get results from an ordered query that might not be correct with respect to the entire data set.

I tried the mentioned query, setting a timeout of 0 which disables this feature, and I got a 504 gateway timeout after ten minutes, when using the JSON format. This also happened if I set LIMIT 10.

I then reset the timeout to a short one and the results similar to those reported by Hannah. After doing this, querying with a longer timeout returned a quick result, within a few seconds, but it looked to be based on the previous result set (or at least, did not return anyone with more than 15 professions), even if I increased the limit, which seems odd. I'd expect it to keep searching until the timeout.

In fairness, I do not know if this is a query for which it would be reasonable to expect a complete result. Wikidata Query Service times out repeatedly on the same query after one minute.

Thanks, Kingsley, that explains it!

I think it's a useful feature, but it's quite confusing that this important bit of information (whether the query result is correct or just an approximation) is contained only in the result header, but neither on the HTML result page, nor when one downloads the result as, say, TSV. Wouldn't it be easy to add this bit of information at least to the HTML result page? For example, in the form of a message on top of the table.

It also leads to seemingly non-deterministic behavior. When I tried various queries the other day, I sometimes got an empty result (which led me to believe that I mistyped something, but I didn't), and sometimes not. A message would have helped.

What also confused me was that the query SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o } always gives an empty result. That is a very common and useful query to check the size of the knowledge graph. And it's actually very easy to answer. On https://query.wikidata.org , which is otherwise not known for super-fast query times, one gets the result in an instant.

A few things:

  1. This behavior is configurable i.e., we just have a restrictive timeout associated with the current public instance
  2. We could also include HTML-level messaging to complement the HTTP-level messaging

This feature is the result of a fundamental challenge associated with the following, when you publish a query for ad-hoc query access to the Web:

  1. Unpredictable Query Complexity
  2. Unpredictable Query Solution Size
  3. Unpredictable Number of User Agents triggering and combination of the above.

Conventional DBMS products can't handle this problem, hence the creation of this feature at the time Virtuoso was created circa 1998.

Alternatively, as many do, you can simply throw lots of machines at the problem via horizontal partitioning using shards which sets you on the path to massive data centers (the norm worldwide these days).

SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }

Is a typical expensive query that falls into the list I presented above. It's no different to

SELECT * FROM {Some-Table}

in a typical SQL-compliant RDBMS which is why you don't see those published to Web -- although Virtuoso instances handle this too using the same "Anytime Query" functionality with a configurable timeout.

Getting a result vs an actual latest and greatest result are different things. There is no way on earth that

SELECT (COUNT(*) AS ?count) WHERE { ?s ?p ?o }

is happening in realtime on the Blazegraph instance . We could return a value from our statistics table too, but that isn't the global interpretation of a solution for that query. For instance, we could put all the relevant stats in a voID graph which scheduled updates to it if need be.

With Virtuoso, everything can be configured to suit the interaction behavior desired. It just so happens that ad-hoc querying, for global 24/7 access; irrespective of solution size, complexity, origins; is a fundamental challenge that isn't generally understood since its pegged to the emergence of the Web :)

Anyway, here's a SPARQL Query Results Page driven by a Query using a 0 milliseconds timeout (rather than the default 3000) that provides an instant solution.

Long story short, you disable the "Anytime Query" feature by setting timeout to "0" .

curl -I "https://wikidata.demo.openlinksw.com/sparql?default-graph-uri=http%3A%2F%2Fwww.wikidata.org%2F&query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%3Fs+%3Fp+%3Fo+%7D&format=text%2Fhtml&timeout=0&signal_void=on&signal_unconnected=on"
HTTP/1.1 200 OK
Date: Fri, 10 Sep 2021 18:27:42 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 1080
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/08.03.3320 (Linux) x86_64-generic-linux-glibc25 VDB
Accept-Ranges: bytes
X-SPARQL-default-graph: http://www.wikidata.org/
Content-disposition: filename=sparql_2021-09-10_18-27-42Z.html
Expires: Fri, 10 Sep 2021 19:27:42 GMT
Cache-Control: max-age=3600
Strict-Transport-Security: max-age=15768000

In response to this thread, I've requested we set the default to "0" rather than "30000" which should avert some of the confusion experienced.

For whoever is interested, I wrote more about the QLever SPARQL engine on this thread: https://phabricator.wikimedia.org/T290839 .

I imported the wikidata-DB into neo4j and it works quite well.

I imported the wikidata-DB into neo4j and it works quite well.

Can you be more specific? When we tested Wikidata on Neo4j several years ago, it worked in principle, but the performance was unacceptable. In particular, Neo4j does not efficiently support all kinds of JOIN operations that occur in typical SPARQL queries. Could you time a few SPARQL queries on your Neo4j instance and report the results here? That would be very helpful. For starters, you can simply pick some example queries from https://query.wikidata.org

Create a query set (extracted from the WDQS log) and a Wikidata subset of data to benchmark against graph databases (such as TPC).
Ask graph database vendors to test their products and publish the results to the community.

See
http://tpc.org/
https://github.com/socialsensor/graphdb-benchmarks

Consider relational databases with particular schema as graph backends

Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas, Enzo Zerega. "Querying Wikidata: Comparing SPARQL, Relational and Graph Databases". In the Proceedings of the 15th International Semantic Web Conference (ISWC), Kobe, Japan, October 17–21, 2016

Consider property graph back ends such as Neo4J and TigerGraph

Kovács, T., Simon, G., & Mezei, G. (2019). Benchmarking Graph Database Backends—What Works Well with Wikidata?. Acta Cybernetica, 24(1), 43-60. https://doi.org/10.14232/actacyb.24.1.2019.5

@So9q @AndreasKuczera @Versant.2612 why are you polluting the thread by suggesting projects/products that clearly do not meet the requirements? This includes Ontop, JanusGraph, TigerGraph, Neo4J etc.

“Pollution” is a strong word that comes off as needlessly hostile. It seems
prudent and rational to get a broad sense of the landscape(and where it is
moving). The Wikidata data model is not trivially 1:1 with RDF/SPARQL and
there may be scope for hybrid solutions.

@DanBri I would agree if this issue was not specifically about "alternatives to BlazeGraph" (RDF triplestore), with explicit requirements. Finding such alternative will already be difficult if not impossible, mostly due to the open-source requirement.

If you want non-RDF solutions be evaluated as well, then I think a separate issue should be created. But I doubt it has a chance of being completed within any reasonable timeframe.

Hey all, apologies if this has already been covered elsewhere, but I'm curious why Apache Jena Fuseki is not on the list of Blazegraph alternatives? It seems to meet the We've used Jena from time to time and really like it (it has a lot of features out of the box), but if there's been a previous analysis and it was not worth considering for WDQS's needs I'd love to learn from that.

Hey all, apologies if this has already been covered elsewhere, but I'm curious why Apache Jena Fuseki is not on the list of Blazegraph alternatives? It seems to meet the We've used Jena from time to time and really like it (it has a lot of features out of the box), but if there's been a previous analysis and it was not worth considering for WDQS's needs I'd love to learn from that.

I think only because so far no-one has brought it up. Please add a ticket for it with additional information.

I am taking the liberty to polute the thread with a reference to "MillenniumDB: A Persistent, Open-Source, Graph Database" https://arxiv.org/pdf/2111.01540.pdf from November 2021. Millennium may have some serious limitations in terms of requirements that can be setup, but interestingly they write "However, MillenniumDB was designed with the complete version of Wikidata – including qualifiers, references, etc. – in mind." and their benchmarks seems strong. They compare against Blazegraph, Jena, Virtuoso and Neo4J.

I am taking the liberty to polute the thread with a reference to "MillenniumDB: A Persistent, Open-Source, Graph Database" https://arxiv.org/pdf/2111.01540.pdf from November 2021. Millennium may have some serious limitations in terms of requirements that can be setup, but interestingly they write "However, MillenniumDB was designed with the complete version of Wikidata – including qualifiers, references, etc. – in mind." and their benchmarks seems strong. They compare against Blazegraph, Jena, Virtuoso and Neo4J.

Thanks for the pointer! Here are my first impressions from reading the paper:

  1. The engine is based on similar ideas as QLever. However, QLever is around for 5 years already, which the authors fail to acknowledge. I am sure they didn't do it on purpose though. I wrote to them.
  1. Like QLever, their engine currently is read-only and does not support SPARQL Update operations. Given the design of their engine, this is not something that will be easy to add.
  1. Their engine is currently very far away from SPARQL 1.1 support. In the current version, even basic features like GROUP BY and mathematical expressions are missing. I am not sure whether they actually strive for SPARQL 1.1 support, since the motivation expressed in the paper goes more in the direction of a more general data model that is independent of a particular query language. Anyway, adding full SPARQL 1.1 support would be a lot of work, as we know from experience.
  1. I find the evaluation misleading. Right at the beginning of their evaluation section, in Section 5.1, they claim that their engine is 30 times faster than Virtuoso for very simple queries (consisting of a single triple). We know Virtuoso very well and have compared it with QLever extensively. Virtuoso is a very mature and efficient engine and hard to beat, even on more complex queries. On simple queries, there are natural barriers to what can be achieved, and Virtuoso often (though not always) does the optimal thing. I think the authors either did not configure Virtuoso optimally or they stumbled on an artefact without being aware of it. Namely, Virtuoso is rather slow when it has to produce a very large output. That is not a weakness of their query processing engine, but of the way they translate their internal IDs to output IRIs and literals.

@KingsleyIdehen maybe you can provide some feedback concerning @4, in particular, the last two sentences.

We can be objective about feature support.

The working group tests for SPARQL 1.1 (updated for RDF 1.1) are maintained by the community: https://w3c.github.io/rdf-tests/.

They have reasonable coverage of features.

In addition, engines can and do support more of "XPath and XQuery Functions and Operators 3.1" than the minimal required by the SPARQL REC.

https://www.w3.org/TR/xpath-functions-3/

This comment was removed by nguyenm9.

also, any thoughts on https://cambridgesemantics.com/anzograph/ ?

"Horizontally Scalable Graph Database Built for Online Analytics and Data Harmonization"

it looks like anzograph could handle 1 trillion triples back in 2016.

Are there any timescale/triple scale goals currently being stated?

With a baseline minimum of 1B triples/3 months, and assuming a 5-10 year goal for any choice, that gets to 36B-56B triples minimum and it could easily exceed that.

Query performance is an important point to consider - I found a query that will run one million time slower in one database engine than in another one

Query performance is an important point to consider - I found a query that will run one million time slower in one database engine than in another one

Claims without evidence, such as that quoted above, are generally not helpful for evaluations such as this.

It would be helpful to all if you would post the query you describe, as well as the details of your testing — such as which engine(s) you tested (including name and version), on which OS (including version), on what hardware (including processor, bitness, and RAM), whether the engine & data were in a "hot/warm" state or just past cold start, etc.

Testing your query against current public endpoints and posting details of those results would also be helpful.

YOUR1 removed a subscriber: YOUR1.
YOUR1 added a subscriber: YOUR1.