Feb 26 2022
To add to this, the two-index approach has another rather beautiful property:
Feb 5 2022
@Pulquero AFAIK the two main problems with Blazegraph are:
@Hannah_Bast maybe this interests you? Do you think this system would perform well considering the load on WDQS and the type of queries we have?
@Pulquero Thank you for this interesting piece of information. I have a few questions:
Dec 11 2021
Oct 15 2021
Oct 8 2021
@DD063520: You find some details at https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md .
Oh, ok. Could you give an example of a query that has no "highly selective triples" so I can test it on QLever vs. BG?
Oct 2 2021
@Justin0x2004 Thanks, Justin. QLever already supports something like named subqueries. You can simply have the same subquery in multiple places and it will be evaluated only once and for the other occurrences, the result will be reused.
Sep 30 2021
I have now revised QLever's Quickstart page: https://github.com/ad-freiburg/qlever
We looked a bit into Apache Rya. A couple of observations:
Sep 28 2021
I will provide a detailed reply later today, also to the other thread. Four things for now:
Sep 17 2021
It's of course up to you (the Wikidata team) to decide this. But I wouldn't dismiss this idea so easily.
Sep 15 2021
Wikibase doesn’t store data in RDF, so dumping the data set means parsing the native representation (JSON) and writing it out again as RDF, including some metadata for each page.
Can you or anyone else explain why the data dump takes so long, Lukas? One would expect that it is much easier to dump a (snapshot of a) dataset than to build a complex data structure from it. Also, dumping and compression are easily parallelized. And the pure volume isn't that large (< 100 GB compressed).
I agree with Kingsley that you don't need a distributed SPARQL engine when the knowledge graph fits on a single machine and will do so also in the future. Which is clearly the case for Wikidata, since it's even the case for the ten times larger UniProt (which at the time of this writing already contains over 90 billion triples).
PS: Note that large query throughputs are not a problem for a SPARQL engine that runs on a single standard PC or server. Depending on the overall demand, you can just run multiple instances on separate machines and trivially distribute the queries. What's more important, I think, is the processing time for individual queries because you cannot easily distribute the processing of an individual query. And it does make quite a difference for the user experience whether a query takes seconds, minutes, or hours. The current SPARQL endpoint for Wikidata (realized using Blazegraph) times out a lot when the queries are a bit harder.
Yes, QLever is developed in our group at the University of Freiburg. I presented it to the Wikidata team in March. You can try out a demo on the complete Wikidata on https://qlever.cs.uni-freiburg.de/wikidata . You can also select other interesting large knowledge graphs there, for example, the complete OpenStreetMap data.
Sep 14 2021
I have already talked about Sage with Lukas last November. I don't think that Sage is an option for Wikidata. The focus of Sage is on the ability to pause and resume SPARQL queries (which is a very useful feature), not on efficiency. For example, if you run the people-professions query from https://phabricator.wikimedia.org/T206560 on their demo instance of Wikidata http://sage.univ-nantes.fr/#query (which has only 2.3B triples), it takes forever. Also simple queries are quite slow. For example, the following query (all humans) produces results at a rate of around a thousand rows per second:
Sep 13 2021
Would it be an option that one of these two backends uses a SPARQL engine that does not support the SPARQL Update operation, but instead rebuilds its index periodically, for example, every 24 hours?
Sep 9 2021
Thanks, Kingsley, that explains it!