Page MenuHomePhabricator

Evaluate QLever as a time lagging SPARQL backend to offload the BlazeGraph cluster
Closed, ResolvedPublic

Description

Based on the discussions in T290839 I suggest we evaluate whether QLever can offload the query-pressure on BlazeGraph to some degree as a second option for users.

The hypothesis is that most users do not really NEED real-time or near-real-time data when querying WDQS.

Benefits:

  • By offering QLever as an alternative the WMF can offer users a lot of query results that is impossible given the 60s timeout on WDQS today.
  • Be having 2 systems the load should distribute between the two. Especially since QLever is way faster than WDQS.

Drawbacks:

  • QLever does not support all the types of queries BG can do, see limitations below.

Options:

  • restrict WDQS to logged in users only. This would probably move a big chunk of the current load away from WDQS -> QLever. This could possibly lead to more active contributors in Wikidata as a side-effect.
  • impose a penalty for WDQS queries. E.g. display a blocking banner in the UI for 10 seconds that encourages the user to run queries on QLever if possible.
  • automatically check all queries before running them if they could run on QLever instead and do that and inform the user that the results are time-lagged. Give the user an option to run it on near-real-time data instead.

Limitations:
I did some investigating on what type of queries QLever supports. It does not seem to support subqueries or queries that has "WITH"

Event Timeline

So9q renamed this task from Evaluating QLever as a time lagging SPARQL backend to offload the BlazeGraph cluster to Evaluate QLever as a time lagging SPARQL backend to offload the BlazeGraph cluster.EditedSep 28 2021, 8:19 AM

Given the limitations many Scholia queries need BG to work or will need to be rewritten to avoid subqueries.

I will provide a detailed reply later today, also to the other thread. Four things for now:

  1. The basic SPARQL features are all supported already, including LIMIT, OFFSET, ORDER BY, GROUP BY, HAVING, COUNT, DISTINCT, SAMPLE, GROUP_CONCAT, FILTER, REGEX, LANG, OPTIONAL, UNION, MINUS, VALUES, BIND.
  1. Subqueries and predicate paths are also supported. Where is it written that they are not?
  1. The missing SPARQL features in QLever that have nothing to do with SPARQL Update are easy to add and will be added very soon.
  1. Before you start your evaluation, let me update the GitHub repository with a Makefile and instructions for how to build indices and start the engine very conveniently. We already have that in place, but just haven't updated the GitHub master yet.

I have now revised QLever's Quickstart page: https://github.com/ad-freiburg/qlever

It allows you to build the code, build an index for a given set of triples, and start the engine with just a few copy&paste operations. With two example datasets, one small (120 Years of Olympics, 1.8M triples) and one large (the complete Wikidata, 12B triples). Building the small dataset takes around 20 seconds. Building the complete Wikidata takes around 20 hours. On a standard PC.

@So9q : To clarify, QLever does not rely on any third-party software for storing its triples. QLever has it's own data structures and query engine. In fact, that is what QLever is about and that is why it is so fast (also for queries that lack highly selective triples or that require the computation of large intermediate or final results, which are very hard for other SPARQL engines).

Subqueries and predicate paths are also supported. Where is it written that they are not?

@Hannah_Bast
I think they mean this feature of BG:
https://github.com/blazegraph/database/wiki/NamedSubquery

The missing SPARQL features in QLever that have nothing to do with SPARQL Update are easy to add and will be added very soon.

That is great to hear!

@Justin0x2004 Thanks, Justin. QLever already supports something like named subqueries. You can simply have the same subquery in multiple places and it will be evaluated only once and for the other occurrences, the result will be reused.

We don't yet support the "WITH" syntax, but that will be easy to add. As I wrote on https://phabricator.wikimedia.org/T290839#7354220 about two weeks ago:

"QLever does not yet have full SPARQL 1.1 support, but we are approaching that and will be there soon. The basic features are all there and what's missing are mostly small things."

  1. Subqueries and predicate paths are also supported. Where is it written that they are not?

Oh it was a misunderstanding on my part. I copy pasted a query from Scholia that had the "WITH" syntax and got a syntax error. I seem to have mixed the 2 concepts subquery and named subquery (using WITH).

I have now revised QLever's Quickstart page: https://github.com/ad-freiburg/qlever

It allows you to build the code, build an index for a given set of triples, and start the engine with just a few copy&paste operations. With two example datasets, one small (120 Years of Olympics, 1.8M triples) and one large (the complete Wikidata, 12B triples). Building the small dataset takes around 20 seconds. Building the complete Wikidata takes around 20 hours. On a standard PC.

Fantastic! That makes it easier setting it up on Toolforge, which I would like to try if no one beats me to it.

@So9q : To clarify, QLever does not rely on any third-party software for storing its triples. QLever has it's own data structures and query engine. In fact, that is what QLever is about and that is why it is so fast (also for queries that lack highly selective triples or that require the computation of large intermediate or final results, which are very hard for other SPARQL engines).

Oh, ok. Could you give an example of a query that has no "highly selective triples" so I can test it on QLever vs. BG?

Oh, ok. Could you give an example of a query that has no "highly selective triples" so I can test it on QLever vs. BG?

Here is a relatively simple query without a highly selective triple. It asks for the 100 people with the most professions. It requires a JOIN of the first triple (around 9 million people) with the second triple (all people and their professions, around 8.5 million triples). And there is no easy way around computing the full join result because we want the people with the most professions in the end and you cannot know in advance which people these are. The query deliberately does not have a large query result. So if it takes long, it's not because the output is so large.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person ?name (COUNT(?profession) AS ?count) WHERE {
  ?person wdt:P31 wd:Q5 .
  ?person wdt:P106 ?profession .
  ?person rdfs:label ?name .
  FILTER (LANG(?name) = "en")
}
GROUP BY ?person ?name
ORDER BY DESC(?count)
LIMIT 100

PS: Here you can see the performance on QLever: https://qlever.cs.uni-freiburg.de/wikidata/cYvT6w

Hi Hannah,

nice work! I have some questions about the hardware requirements .... could you tell me:

  1. the machine specs that you need to index the data (I read somewhere it takes 24h)
  2. the final index size for the current Wikidata dump?

Merci
D063520

@DD063520: You find some details at https://github.com/ad-freiburg/qlever/blob/master/docs/quickstart.md .

For the current Wikidata, indexing takes around 24 hours on a (cheap) AMD Ryzen 9 5900X with 128 GB of RAM and HDDs (which are cheap compared to SSDs). Our goal is an indexing time of at most 1 hour / 1 billion triples and we are not far from that. The latest version of Wikidata (including the lexeme data) has around 16B triples.

The current index size is 750 GB, but this will soon be cut in half by some further compression. With efficient support only for queries where each predicate has a fixed value and is no variable (which is the case for most queries), the index size is 250 GB. Being super space-efficient was not among our high-priority goals so far, since space is relatively cheap.

Oh, ok. Could you give an example of a query that has no "highly selective triples" so I can test it on QLever vs. BG?

Here is a relatively simple query without a highly selective triple. It asks for the 100 people with the most professions. It requires a JOIN of the first triple (around 9 million people) with the second triple (all people and their professions, around 8.5 million triples). And there is no easy way around computing the full join result because we want the people with the most professions in the end and you cannot know in advance which people these are. The query deliberately does not have a large query result. So if it takes long, it's not because the output is so large.

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person ?name (COUNT(?profession) AS ?count) WHERE {
  ?person wdt:P31 wd:Q5 .
  ?person wdt:P106 ?profession .
  ?person rdfs:label ?name .
  FILTER (LANG(?name) = "en")
}
GROUP BY ?person ?name
ORDER BY DESC(?count)
LIMIT 100

PS: Here you can see the performance on QLever: https://qlever.cs.uni-freiburg.de/wikidata/cYvT6w

Thanks for the swift reply. Wow.

bild.png (93×1 px, 12 KB)

BG is totally aced 😅 (times out, so 60+s vs. 1.97s)

I'm feeling more and more motivated to get QLever up and running on Toolforge. Send my thanks to your team for writing this beautiful and fast engine and releasing it as open source software!

Gehel triaged this task as Medium priority.Oct 11 2021, 1:30 PM
Gehel moved this task from Incoming to Scaling on the Wikidata-Query-Service board.

@Hannah_Bast informed in the last WDQS scaling meeting that QLever could have 2 indexes to provide near-realtime queries. See https://github.com/ad-freiburg/qlever/wiki/QLever-support-for-SPARQL-1.1-Update

To add to this, the two-index approach has another rather beautiful property:

  1. It is important to understand that real-time updates have an inherent price. An engine that supports real-time updates can never be as fast as a read-only engine. But with the approach outlined in https://github.com/ad-freiburg/qlever/wiki/QLever-support-for-SPARQL-1.1-Update we kind of get the best of both worlds:
  1. Combining the two indexes gives you full SPARQL 1.1 Update capability. There is an unavoidable penalty in runtime, but if the amount of updates is small relative to the size of the data already there (for Wikidata, we are talking millions of updates in a day vs. billions of triples already in the database), the penalty is relatively small.
  1. But you can also choose to only ask the large index. Than you get results on a snapshot of the data from a certain (known) date that lies up to 24 hours in the past. But you get the result with maximum speed.
  1. Since the approach naturally enables this choice for each individual query, every user can decide on the trade-off for themselves for each query.

Most of SPARQL 1.1 is now supported with few exceptions. QLever is looking more promising by the day. Very nice work!
See https://github.com/ad-freiburg/qlever/wiki/Current-deviations-from-the-SPARQL-1.1-standard for details

To add to this, the two-index approach has another rather beautiful property:

  1. It is important to understand that real-time updates have an inherent price. An engine that supports real-time updates can never be as fast as a read-only engine. But with the approach outlined in https://github.com/ad-freiburg/qlever/wiki/QLever-support-for-SPARQL-1.1-Update we kind of get the best of both worlds:
  1. Combining the two indexes gives you full SPARQL 1.1 Update capability. There is an unavoidable penalty in runtime, but if the amount of updates is small relative to the size of the data already there (for Wikidata, we are talking millions of updates in a day vs. billions of triples already in the database), the penalty is relatively small.
  1. But you can also choose to only ask the large index. Than you get results on a snapshot of the data from a certain (known) date that lies up to 24 hours in the past. But you get the result with maximum speed.
  1. Since the approach naturally enables this choice for each individual query, every user can decide on the trade-off for themselves for each query.

I really like this approach! :)