Page MenuHomePhabricator

Evaluate a double backend strategy for WDQS
Open, LowPublic

Description

I suggest the Search Platform team evaluate whether it is feasible to have 2 different backends instead of one.

The rationale is as follows:

  • users could choose between 2 different WDQS endpoints running on different backends
  • it gives WMF a possibility to measure and compare performance over time
  • it makes it possible to later on discontinue the least suitable of the 2
  • it makes it less likely to end up in a Blazegraph situation again
  • we don't have a shortage of computing resources or money, so 2 clusters with 6 machines each is fine if needed
  • users will themselves distribute between the 2 endpoints and help lower the load for each cluster
  • it will be easier to conduct experiments on one cluster at a time as long as the other is up (redundancy)

(added 2021-09-27)
Evaluate whether one of the services could be updated every 48-72 hours and have other benefits (like speed and better precomputed optimizations) than the "live" one that tracks edits in near-real time.
see https://ad-publications.cs.uni-freiburg.de/ARXIV_sparql_autocompletion_BKKKS_2021.pdf for a paper mentioning optimizations of e.g. wdt:P31/wdt:P279*-queries which are very valuable and common at least to me (but often time out on WDQS and most probably will be too expensive also on a future BG replacement).

Event Timeline

Would it be an option that one of these two backends uses a SPARQL engine that does not support the SPARQL Update operation, but instead rebuilds its index periodically, for example, every 24 hours?

The reason I am asking is that this would allow a SPARQL engine architecture that makes many interesting queries feasible, which currently time out on Blazegraph or take very long and which are generally hard for an engine that has to deal with a dynamically changing knowledge graph. In particular, this includes:

  1. Queries, where one wants the top items from a long list of items. For example: the top-10 people with the most professions (see the other thread) or the top-10 people with the most sitelinks.
  1. Queries downloading larger subsets of Wikidata. For example, all movies with their cast and genres.
  1. Autocompletion queries that help the user construct SPARQL queries incrementally.
  1. Other queries that involve large intermediate or final results.

I am assuming that many users don't mind getting a result that does not incorporate the changes from the last few hours. Of course, this should be made transparent when using that backend.

One point to consider: Blazegraph brings some feature outside SPARQL standard, and currently some queries are using them.

MPhamWMF moved this task from Incoming to Scaling on the Wikidata-Query-Service board.

@Hannah_Bast, I would like having the option to use ~week old Wikidata if it meant I could run more complex queries.

@Lucas_Werkmeister_WMDE has mentioned Sage as a candidate for a read-only triplestore.

Has anyone tried to use Sage with a recent full Wikidata dump? The Sage website does have a "Wikidata 2017-03-13 release" and it does look like you can query it.

I have already talked about Sage with Lukas last November. I don't think that Sage is an option for Wikidata. The focus of Sage is on the ability to pause and resume SPARQL queries (which is a very useful feature), not on efficiency. For example, if you run the people-professions query from https://phabricator.wikimedia.org/T206560 on their demo instance of Wikidata http://sage.univ-nantes.fr/#query (which has only 2.3B triples), it takes forever. Also simple queries are quite slow. For example, the following query (all humans) produces results at a rate of around a thousand rows per second:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?person WHERE {
  ?person wdt:P31 wd:Q5 .
}

My understanding is that they have not implemented their own SPARQL engine, but that they use a vanilla database engine in the background. On https://github.com/sage-org/sage-engine they mention SQLite and PostgreSQL

The focus of Sage is on the ability to pause and resume SPARQL queries

Ah, I see that now.

I was just looking on awesome-semantic-web for other efficient (perhaps read-only triplestores) and I saw a familiar name on QLever which looks like another candidate. It even has a Wikidata quickstart guide. Could it handle 11+ million SPARQL queries per day?

Yes, QLever is developed in our group at the University of Freiburg. I presented it to the Wikidata team in March. You can try out a demo on the complete Wikidata on https://qlever.cs.uni-freiburg.de/wikidata . You can also select other interesting large knowledge graphs there, for example, the complete OpenStreetMap data.

QLever's focus is on efficiency (also for hard queries) for large knowledge graphs on standard hardware, in particular, without the need for a cluster or an exorbitant amount of RAM. For example, the demo above runs on a standard PC with 128 GB of RAM and it can compute the complete result for the people-profession query from https://phabricator.wikimedia.org/T206560 (6.1 million rows, where one column contains the result of a GROUP_CONCAT) in five seconds: https://qlever.cs.uni-freiburg.de/wikidata/4oNHPq . Another non-trivial query involving many triples is this one (average height by occupation and gender): https://qlever.cs.uni-freiburg.de/wikidata/gVWJ4h . There are many more example queries, ranging from easy to hard, under the drop-down menu "Examples".

The thing I personally love most about QLever is the interactive context-sensitive autocompletion feature. Constructing SPARQL queries is really hard and cumbersome, even if you are an expert. The autocompletion gives you suggestions at every point in the query. The suggestions are context-sensitive in the sense that they are meaningful continuations of the query so far. This is important for large knowledge graphs, where you have millions of entities and guessing the right names is often impossible. Try it for yourself by first typing S for SELECT ... and then starting a query (with a variable or by typing the prefix of an entity name, as you like).

QLever does not yet have full SPARQL 1.1 support, but we are approaching that and will be there soon. The basic features are all there and what's missing are mostly small things. The one bigger feature that is missing is SPARQL Update operations. We are also thinking about adding that, but that will probably not happen in the next few months. I think that for a read-only large knowledge graph, QLever is a very good option. Indexing is also fast: on a machine like the above, the complete Wikidata can be indexed in under a day. So having a daily fresh index would be no problem.

PS: Note that large query throughputs are not a problem for a SPARQL engine that runs on a single standard PC or server. Depending on the overall demand, you can just run multiple instances on separate machines and trivially distribute the queries. What's more important, I think, is the processing time for individual queries because you cannot easily distribute the processing of an individual query. And it does make quite a difference for the user experience whether a query takes seconds, minutes, or hours. The current SPARQL endpoint for Wikidata (realized using Blazegraph) times out a lot when the queries are a bit harder.

PS: Note that large query throughputs are not a problem for a SPARQL engine that runs on a single standard PC or server. Depending on the overall demand, you can just run multiple instances on separate machines and trivially distribute the queries. What's more important, I think, is the processing time for individual queries because you cannot easily distribute the processing of an individual query. And it does make quite a difference for the user experience whether a query takes seconds, minutes, or hours. The current SPARQL endpoint for Wikidata (realized using Blazegraph) times out a lot when the queries are a bit harder.

+1

Thanks a lot for your insights, Hannah!

@Hannah_Bast I could not find the date of the wikidata dump used in the service, is that available in the UI?

bild.png (432×1 px, 37 KB)

Indexing is also fast: on a machine like the above, the complete Wikidata can be indexed in under a day. So having a daily fresh index would be no problem.

As I wrote in T206560#6608418, I think it currently takes us over a day to produce a full Wikidata RDF dump, so I don’t think fast indexing helps us very much with providing an up-to-date service.

Can you or anyone else explain why the data dump takes so long, Lukas? One would expect that it is much easier to dump a (snapshot of a) dataset than to build a complex data structure from it. Also, dumping and compression are easily parallelized. And the pure volume isn't that large (< 100 GB compressed).

Wikibase doesn’t store data in RDF, so dumping the data set means parsing the native representation (JSON) and writing it out again as RDF, including some metadata for each page.

@Lucas_Werkmeister_WMDE What is the relationship between Blazegraph and Wikibase? It seems like Blazegraph would have some export functionality.

Wikibase doesn’t store data in RDF, so dumping the data set means parsing the native representation (JSON) and writing it out again as RDF, including some metadata for each page.

That is what I expected. But converting less than 100 GB of data from one format to another should not take that long. Which software are you using for the conversion? For example, Apache Jena has all kinds of tools to convert to and from the various RDF formats. It works, but it's incredibly slow (ten times slower than more efficient converters).

This is getting a bit off-topic now… I think the main point is that we really need live update functionality. I highly doubt reindexing from scratch all the time can give us the short update delays that users expect, no matter how much we optimize the RDF dumps.

can give us the short update delays that users expect

I am a user that rarely needs short update delays.
Didn't we just take a poll about what features of WDQS users prefer/want? Do we have the results of that to see if a double backend strategy would satisfy users?

It's of course up to you (the Wikidata team) to decide this. But I wouldn't dismiss this idea so easily.

There is clearly a group of users who want to query the exact contents of the database at the point in time they are querying it. I assume that this group includes many Wikimedians and all kinds of statistics queries on Wikidata. But I am sure that there is also a large group of users who don't care if the version of Wikidata they are querying is a few hours old, but who care much more about convenience and efficiency (or getting results at all, which is clearly a problem with the current service).

Now this here is a (low-priority) thread about a "double backend strategy for WDQS". If there were an engine that can answer all "reasonable" queries efficiently and that supports SPARQL update operations, there would be no need for this debate. Based on my own experience, I personally think that Virtuoso comes close to being this engine. It is the most mature SPARQL engine on the market when it comes to handling very large datasets with reasonable hardware and it's remarkable how fast it is even for some fairly complex queries.

But there are many reasonable queries which by design are very hard also for Virtuoso (and which indeed time out on Virtuoso's Wikidata SPARQL endpoint). In my experience, there is a clear trade-off between efficiency and the support of live updates. There is just a lot of room for optimization when you have read-only data and you can rebuild the index from scratch periodically.

It's of course up to you (the Wikidata team) to decide this. But I wouldn't dismiss this idea so easily.

There is clearly a group of users who want to query the exact contents of the database at the point in time they are querying it. I assume that this group includes many Wikimedians and all kinds of statistics queries on Wikidata. But I am sure that there is also a large group of users who don't care if the version of Wikidata they are querying is a few hours old, but who care much more about convenience and efficiency (or getting results at all, which is clearly a problem with the current service).

+1
I wrote a query today with wdt:P31/wdt:P279* that timed out. I implemented some workarounds though to get what I wanted and computed locally. See https://github.com/dpriskorn/ItemSubjector/blob/prepare-batch-improved-structure/fetch_main_subjects.py

Now this here is a (low-priority) thread about a "double backend strategy for WDQS". If there were an engine that can answer all "reasonable" queries efficiently and that supports SPARQL update operations, there would be no need for this debate. Based on my own experience, I personally think that Virtuoso comes close to being this engine. It is the most mature SPARQL engine on the market when it comes to handling very large datasets with reasonable hardware and it's remarkable how fast it is even for some fairly complex queries.

I would very much like to see a comparison of Virtuoso and Rya on complex queries. Rya has some interesting query optimizations that are described a little and linked here: https://phabricator.wikimedia.org/T289561#7321936

But there are many reasonable queries which by design are very hard also for Virtuoso (and which indeed time out on Virtuoso's Wikidata SPARQL endpoint). In my experience, there is a clear trade-off between efficiency and the support of live updates. There is just a lot of room for optimization when you have read-only data and you can rebuild the index from scratch periodically.

Interesting. I edited this task to mention the possibility of a time-lagging heavily optimized endpoint and a real-time endpoint like we have today with Blazegraph.

I personally think this could offload a lot of the request-pressure we are seeing on WDQS right now. People who are not tech-savvy and or cannot afford time or money to set up their own endpoint have few good options to run expensive queries and get the data they want.

@Hannah_Bast do you know if QLever supports a column-store backend? If we adopt Rya and have a column-store cluster, how could we best handle snapshotting/moving the data (I'm thinking it will be 1TB in a few years) efficiently to a QLever cluster?
Has anyone done operations like this before?
I found this in a quick search: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html
Of importance here is knowing the time to take a snapshot. This indicates that it is near-instant which is very good news https://stackoverflow.com/questions/25137947/understanding-hadoop-snapshots-functionality

Since Virtuoso to my knowledge does not support a column store, it is likely more work to implement a double endpoint strategy if the data has to be moved around than if Rya or a similar engine on top of a column store.

If QLever does not support hadoop the data could be precomputed either from the snapshot or copied elsewhere and then precomputed.

can give us the short update delays that users expect

I am a user that rarely needs short update delays.
Didn't we just take a poll about what features of WDQS users prefer/want? Do we have the results of that to see if a double backend strategy would satisfy users?

The results have not been published yet to my knowledge. But the poll did not mention a double strategy or have any questions related to that. (it did have questions where you got to rank latency, live-updates, etc. order of importance.

I'm wildly guessing that 80-90 % of users would be satisfied with non-realtime data if the lag was shorter than a few days. Most users do not use WDQS to improve the graph based on what is in it right now, they could just as well use a quick endpoint with say 48h old data and not lose out on anything.

For many of my tools relying on editing/completing/adding to Wikidata live-updates are nice and helpful, but I'm probably an outlier in the big picture.

In short, near-real-time is a real luxury, but only providing that comes at a great cost in both infrastructure (BG is way slower than QLever and consumes more resources I presume) and users not getting what they want because of time-outs.

Also a lot of queries can be trial-and-errored on the non-real-time engine and copied to WDQS when they are done, saving a lot of hits on the more expensive and more valuable real-time cluster.

We could even go so far as to demand a WMF login (and stated purpose) for all use of the real-time cluster and divert most of the traffic to the lagging one instead. This would help the 23.000 active editors be able to run longer queries on the current infrastructure (migration from BG could take a long time) at very little cost to external non-WMF users who want to query.

@So9q I have commented on your comments concerning Rya in the "Evaluate Apache Rya as alternative to Blazegraph": https://phabricator.wikimedia.org/T289561#7393732

I have commented on your questions concerning QLever in the "Evaluate QLever ..." thread: https://phabricator.wikimedia.org/T291903#7382766 https://phabricator.wikimedia.org/T291903#7393813

Concerning your wdt:P31/wdt:P279* query: Can you provide the original SPARQL query that you wanted to ask?

@So9q I have commented on your comments concerning Rya in the "Evaluate Apache Rya as alternative to Blazegraph": https://phabricator.wikimedia.org/T289561#7393732

I have commented on your questions concerning QLever in the "Evaluate QLever ..." thread: https://phabricator.wikimedia.org/T291903#7382766 https://phabricator.wikimedia.org/T291903#7393813

Concerning your wdt:P31/wdt:P279* query: Can you provide the original SPARQL query that you wanted to ask?

Never mind. Reading my post again, I realize that the query I tried to write actually works fine (and does not have the property path):

SELECT DISTINCT ?subject WHERE {
    hint:Query hint:optimizer "None".
    ?item wdt:P31 wd:Q13442814;
          wdt:P921 ?subject.
  MINUS{
    ?item wdt:P31 wd:Q8054.  # protein
  }
  MINUS{
    ?item wdt:P31 wd:Q7187.  # gene
  }
}
limit 25000

can give us the short update delays that users expect

I am a user that rarely needs short update delays.
Didn't we just take a poll about what features of WDQS users prefer/want? Do we have the results of that to see if a double backend strategy would satisfy users?

I am another user that doesn't need short update delays. Most of the interesting educational uses of Wikidata I can think of don't need those either.
I would love to have a blazing fast SPARQL endpoint for a well indexed + cached snapshot.
I would like but don't need to have a fast endpoint for the latest updates. If there were an option, I might use that for 10% of queries, trending on the simple side. Especially when it's understood that freshness has a speed cost.

@Gehel : I see you are closing all these issues. Did you evaluate the alternatives in the sense that you imported all the data and run queries over them? I can not see this evaluation in the link ...