Page MenuHomePhabricator

[Epic] Evaluate alternatives to Blazegraph
Open, HighPublic

Description

Since Blazegraph project seems to not be active anymore (last commit 2 years ago at https://github.com/blazegraph/database) we need to evaluate if we want to switch to graph DB project that is more actively supported/developed.

The requirements should be:

  • Full SPARQL 1.1 support, including SPARQL Update
  • Open source
  • Can load and run queries on full Wikidata database

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I would clarify the requirements to “SPARQL support, including SPARQL Update”. For example, Sage boasts stable response times and general responsiveness, which would be useful for us, but its backing store is HDT, a read-only RDF serialization format: since HDT files cannot be efficiently updated, Sage is read-only, so we can’t use it for a live-updating query service.

Smalyshev triaged this task as Medium priority.Oct 17 2018, 4:25 PM

A few wishes I have from an operations point of view for any replacement. Those are not necessarily mandatory, but we should evaluate them at some point:

  • ability to scale both read and write load across multiple nodes
  • ability to limit resource consumption to fail gracefully

I think it is important to keep in mind that significant efforts are being made to unite the RDF and Property Graph communities. One aspect of this is the development of "RDF*" and "SPARQL*" (SPARQL star). BlazeGraph played and continues to play a positive role in this. This is the main paper explaining the concepts behind RDF* and SPARQL*: https://arxiv.org/pdf/1406.3399.pdf

A more recent position paper by Olaf Hartig:
https://blog.liu.se/olafhartig/2019/01/10/position-statement-rdf-star-and-sparql-star/

I represent a research group at the Computer Science and Artificial Intelligence Laboratory at MIT. For the past year, we've been doing a lot of work with local copies Wikidata and have experienced our own share of frustrations with Blazegraph. We are currently looking for alternatives as well. We'd love to hear about the directions being taken here, give our own input as to what capabilities we would hope to find in an alternative, and, perhaps, volunteer our services to help with the development/transition. Our work is a little niche so our recommendations may not be representative of the general need and our group is manned mostly by undergraduate researchers but we'd love to help if we can.

Please let me know who I can talk to specifically about this: jecummin@csail.mit.edu

I represent a research group at the Computer Science and Artificial Intelligence Laboratory at MIT. For the past year, we've been doing a lot of work with local copies Wikidata and have experienced our own share of frustrations with Blazegraph. We are currently looking for alternatives as well. We'd love to hear about the directions being taken here, give our own input as to what capabilities we would hope to find in an alternative, and, perhaps, volunteer our services to help with the development/transition. Our work is a little niche so our recommendations may not be representative of the general need and our group is manned mostly by undergraduate researchers but we'd love to help if we can.

Please let me know who I can talk to specifically about this: jecummin@csail.mit.edu

We're always interested in collaborations! We don't have documented formal requirements (that's part of what we need to do), but what comes to mind right now:

  • horizontal scaling
  • supports SPARQL (this might be a constraint that we could drop if we can't find a solution, but this would mean a world of pain for our users and for the whole ecosystem around WDQS)
  • supports SPARQL services, or a way to emulate them (we might want to review this requirement as well, having a backend service having dependencies on external services is problematic, we might want to implement services as a frontend)
  • OpenSource (obviously)
  • good performances in a context with both heavy reads and heavy writes
  • probably a lot of other things as well (it's late here, we need more time to formalize)

Feel free to join the Search Platform office hours to discuss this more synchronously!

Gehel raised the priority of this task from Medium to High.Sep 30 2020, 1:52 PM

Another option to keep an eye on is QLever. It doesn’t support SPARQL Update yet (and while the stated Wikidata reload time of less than 24 hours is impressive, it’s not enough to replace live updates, especially since I believe it takes us more than 24 hours to produce an RDF dump anyways), but I’m told that update support is being worked on.

I tried to formulate a Wikidata use case for the development of the RDF* / RDF star specification: https://github.com/w3c/rdf-star/issues/29

Interesting. From my (limited) experience, neo4j seems to get a lot of attention.

BTW: there was a discussion about triple store experiences at the SMWCon last year inlcuding blazegraph, virtuoso and jena: https://youtu.be/AB_dwxG_vEs

Another option to keep an eye on is QLever. It doesn’t support SPARQL Update yet (and while the stated Wikidata reload time of less than 24 hours is impressive, it’s not enough to replace live updates, especially since I believe it takes us more than 24 hours to produce an RDF dump anyways), but I’m told that update support is being worked on.

I just wrote the developers of QLever regarding SPARQL Update support and encouraged them to write a grant proposal, see https://github.com/ad-freiburg/QLever/issues/375

Interesting. From my (limited) experience, neo4j seems to get a lot of attention.

BTW: there was a discussion about triple store experiences at the SMWCon last year inlcuding blazegraph, virtuoso and jena: https://youtu.be/AB_dwxG_vEs

https://github.com/neo4j/neo4j is written in Java. I personally don't like Java software. My experience is that they seldom scale well and you always got the VM overhead. Also I think the language might be dying it's not a good choice for anything new these days IMO.

Rust on the other hand seems very attractive from a horizontal scaling viewpoint and I really like that the compiler is so strict that runtime errors are often turns up at compile time instead. That's a huge plus over e.g. C++ which seldom complains at compile time and you have errors pop up during runtime instead.

See this simple comparison for an idea how efficient Rust is compared to Java for backend services.

neo4j: I spoke with the CIO Emil last year in Stockholm and pinged Lydia about it see {T234431#5936337} ... Emil said it was just send him an email if we would like to move this forward

image.png (468×2 px, 92 KB)

I don't understand the "I don't like Java" argument. Blazegraph - which is in discussion here - is also written in Java als are most (all?) other triplestores or graph databases.

Are you suggesting to write a new storage backend from scratch?

I don't understand the "I don't like Java" argument. Blazegraph - which is in discussion here - is also written in Java als are most (all?) other triplestores or graph databases.

Here's one in Rust, which claims to have a level of Wikibase compatibility (i.e. SPARQL endpoint and loader) and better simple query performance - but it may not compete on complex queries yet.

Oxigraph implements the following specifications:

  • SPARQL 1.1 Query, SPARQL 1.1 Update, and SPARQL 1.1 Federated Query. (+SPARQL Graph Store 1.1, but without POST)
  • Turtle, TriG, N-Triples, N-Quads, and RDF XML RDF serialization formats for both data ingestion and retrieval using the Rio library.
  • SPARQL Query Results XML Format, SPARQL 1.1 Query Results JSON Format and SPARQL 1.1 Query Results CSV and TSV Formats.

Still a work in progress; but that progress looks exciting. Its current persistent key-Value backends are RocksDB or Sled. Has a HTTP server, Python and JS bindings (the latter via WASM). Apache 2.0/MIT.

Another Rust option is IndraDB, but it appears to be a lower-level solution than what is being looked for here.

Today I presented a new found alternative to Blazegraph and the bridge in the telegram wikidata chat.
See https://github.com/ontop/ontop
Can it do the job?
No need for sparql update and moving data around it seems.

Could someone take a closer look?

Today I presented a new found alternative to Blazegraph and the bridge in the telegram wikidata chat.
See https://github.com/ontop/ontop
Can it do the job?
No need for sparql update and moving data around it seems.

Could someone take a closer look?

Well, the first issue is that it doesn't support many features of WDQS that I would consider "kinda important" like property paths, federation and parts of GeoSPARQL: https://ontop-vkg.org/guide/compliance.html

The second (and probably much more serious) issue is performance: RDF stores are optimized for SPARQL queries, SQL DBs are not. I haven't tested it, but I would bet with confidence that Ontop can't come close to Blazegraph's level of performance. Their own paper lists several perf comparison studies, but all of them involve only similar (virtual knowledge graph) solutions and no real RDF stores. Raises an eyebrow, if you ask me. :)

@Lydia_Pintscher mentioned a conversation with the Data Commons team at Google, they have this opensource codebase that's somewhat in this area: https://github.com/datacommonsorg/mixer