Page MenuHomePhabricator

Migrate Wikidata off of Blazegraph
Open, HighPublic

Description

Currently there are a lot of issues for evaluation and analysis of Blazegraph replacements, such as:

T206560 - Evaluate alternatives to BG (including lots of subtasks around testing and evaluating alternatives)
T306725 - Decide which BG services to migrate (assuming a migration is bound to happen)

... but no issue for the migration itself. It seems unavoidable and urgent, hence this task.

We should migrate before reloads fail: Blazegraph instability has been slowing down data reloads on WDQS, and may prevent them altogether next time. As the Query Service is the public-facing part of Wikidata in many contexts, this feels like preventing WD itself from being updated.

@Gehel wrote in Feb 2023:

TL;DR: We expect to successfully complete the recent data reload on Wikidata Query Service soon, but we've encountered multiple failures related to the size of the graph, and anticipate that this issue may worsen in the future. Although we succeeded this time, we cannot guarantee that future reload attempts will be successful given the current trend of the data reload process. Thank you for your understanding

Proposal:

  • Migrate WD to a different db backend before we next need to reload the query service. (Even if there is a double-backend solution for a time: T290839)
  • Document the migration process for ourselves and for other wikibase users.

Motivation to do this now:

  1. We need a new production-quality backend. Practicing + testing a migration helps practice future recovery workflows.
  2. Working through the migration process will bring needed attention to this critical step in WD growth
  3. Whatever the challenges, waiting until a backend failure happens will be worse.
  4. There is an ongoing tax for delaying migration: more issues opening every season for fixing slowness, failures, or other inconsistencies with BG.

Current status:
Periodic updates are posted to
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#Current_status.

Event Timeline

Sj renamed this task from Migrate off of Blazegraph to Migrate Wikidata off of Blazegraph.Feb 24 2023, 8:07 PM
Sj updated the task description. (Show Details)

Possibly this is already covered somewhere else and can be closed and merged. Such as existing discussions about:

  • A rough timeline for migration
  • Current status of decisions made & pending, about where and how to migrate
  • A map of components + services affected, so they can be notified and run their own downstream analysis

Other issues that may depend on the details above:

  • Plans for handling possible obstacles or failures
  • Metrics to run on test datasets + then on the full system post-migration
  • Challenges and stopgaps that this may resolve (technical + social debt)
  • Stalled features or service requests that may become possible
MPhamWMF moved this task from Incoming to Scaling on the Wikidata-Query-Service board.
Krinkle subscribed.

FYI: Periodic updates about this topic appear to be posted to https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#Current_status, including several updates since the filing of this task, and the last one was a few days ago.

Thanks @Krinkle. The most significant updates this year seem to be the dramatic speedups observed by @Pfps and @Hannah_Bast on this page
https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives and the more extensive benchmarking here https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines

Ideally someone else would replicate the benchmark results on their own 2500€ consumer machine, and then try it on a server more comparable to what we have in production for blazegraph.

Ideally someone else would replicate the benchmark results

@Sj I don't see any reason to doubt the benchmark results, but I suspect that QLever's lack of an incremental update capability is a showstopper for Wikidata production use.

Given https://github.com/ad-freiburg/qlever/wiki/QLever-support-for-SPARQL-1.1-Update and conversations around that in various foums, 'showstopper' seems too strong.

@tfmorris It was a showstopper for using it as a drop-in replacement for Blazegraph two years ago. SPARQL 1.1 Update was always on QLever's agenda (already two years ago), a first proof of concept was implemented in March 2023, a functional version has been available since May 2024, and we are currently in the process of fully integrating it into the main branch. Unfortunately, Wikidata still does not provide a publicly accessible update stream (this is difficult for a variety of reasons). As soon as that is available, we could provide a SPARQL endpoint that is in sync with the public Wikidata SPARQL endpoint.

@Sj Thanks for reminding me of this link. The information on that page was outdated. Prompted by your comment, I just updated it: https://github.com/ad-freiburg/qlever/wiki/QLever-support-for-SPARQL-1.1-Update.

Regarding the benchmark: https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines provides all the information to replicate the results on any machine easily. More generally, the qlever script (QLever's command-line interface) makes it easy to run an arbitrary given set of queries against an arbitrary SPARQL endpoint.

@Hannah_Bast: Appreciate the update, great to see it!

For reference for readers: T294133 (and partial dup T330521) is tracking making an efficient update stream publicly accessible. Hannah: does demonstrating that this update works require real-time sync? If you could get 5000 RC entries at a time, say through a less efficient API call, and apply them as updates rather than reloading, it would demonstrate that incremental updates work even if not fully integrated into the firehose.

@dcausse , thinking of your comment from office hours this month: this is the sort of hopefully-separable work that I imagine finding a grapher in residence to work on ;)

@Sj Getting the updates in batches would be perfectly fine. But how do you want to verify that it works without having a reference endpoint to compare to?

@dcausse , thinking of your comment from office hours this month: this is the sort of hopefully-separable work that I imagine finding a grapher in residence to work on ;)

Some part of the work in T294133 is indeed separated enough that a person with enough knowledge in java and SPARQL/RDF can tackle autonomously:

  • making the test suite pass with QLever
  • write the generic java client part
  • write a small qlever wrapper that understands its response

Exposing the stream itself might require some coordination with other teams at WMF but having the above would be a great step forward, esp. because I believe we might be able to re-use the updater based on RecentChanges on top of a qlever instance. I'm happy to provide more guidance if someone has the time and is interested in tackling this work.

@Sj Getting the updates in batches would be perfectly fine. But how do you want to verify that it works without having a reference endpoint to compare to?

(bearing in mind that the progress on exposing the real-time stream is promising!)

How about: Update WD entries affecting query q, run WD:q, wait a time t, run QL:q. get t as low as possible. If you're batch updating every 5 minutes and t=6 minutes, that's a good sign.

Noting here that @Hannah_Bast further updated the page above, and QLever now has basic support for SPARQL 1.1 Update -- love to see the velocity of improvement!

@Pfps and I have also started an independent benchmark of SPARQL engines, with QLever topping most benchmarks and a promising candidate for migration. Many WD users who need faster performance are already running it to query the dumps on their own systems.

If another backend turns out to be superior, even better! (feedback on the benchmark metrics and queries warmly welcome) But it would be wonderful to start having a practical conversation about how, not whether, a full migration could happen, before spending too much time on other patches like the graph split.

@Sj and @Pfps Thank you very much for conducting this benchmark. The detailed account of how to install and run the various engines is very useful. Here are some comments regarding the current benchmark queries and benchmarking in general:

  1. SPARQL has a wide variety of constructs and functions, and most of these are indeed used in real queries, see https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples . A benchmark should aim to cover a large fraction of this. The current benchmark focuses on queries involving wdt:P31 and wdt:P279, including very typical ones like in https://github.com/wikius/benchmark-wikidata/tree/main/instance, but then also many (very fascinating but highly specific ones) like those in https://github.com/wikius/benchmark-wikidata/tree/main/order .
  1. It is important to benchmark these features separately. For example, assume that an engine is particularly fast or slow in materializing the final result (as TSV or JSON or whatever). Now if many or most of the benchmark queries have large results, you will always be measuring that effect and maybe only that effect (it if dominates the query time). That being said, a benchmark should, of course, also test features in combination, but that should only be a (clearly distinguishable) part of the benchmark.
  1. For each query, three variants should be benchmarked: (1) a query which just counts the number of results without materializing them; (2) a query with a not so large LIMIT; (3) a query which computes and downloads the full result. The first tests how fast an engine computes a result internally. The second tests, how fast an engine can produce partial results. The third also considers the time for materialization (it is enough to execute those for a subset of the queries, otherwise the evaluation might take too long). All three query types occur frequently in practice (sometimes all you need is the count, sometimes a selection of results is fine, and sometimes you need the full result).
  1. The configuration of the various engines should be comparable. It is easy to make sure that the evaluations are run on the same (kind of) machine. Other aspects are more tricky to make comparable. For example, Virtuoso caches results from previous queries, and there is no easy way to disable that. Another example is whether the vocabulary (the mapping of internal IDs to strings) is compressed or not and resides in RAM or on disk. There are many trade-offs involved when building a triplestore.
  1. I would strongly recommend to write the raw results of the evaluation in a machine-readable format, and then compute the statistics from that format. A simple format would be TSV (with each line corresponding to the running time of a particular query on a particular engine with a particular configuration). Even better, (also) write the results as RDF, then they can be analyzed via SPARQL queries! Whatever intermediate format you choose, make sure that one can compute additional statistics from the raw data without rerunning the complete evaluation (which might be too expensive or not feasible anymore at a later time).
  1. It is important to also check the correctness of the results (the number of results is a good proxy). Sometimes, it is easy to be fast by computing a (slightly or totally) wrong result.

There are many papers on performance testing. Here is a recent very good one from the makers of DuckDB: https://hannes.muehleisen.org/publications/DBTEST2018-performance-testing.pdf

@Hanna_Bast Thanks for the detailed comments. I have updated the benchmarking code, which does output TSV files that are later analyzed to produce statistics. Many of the benchmarks are run in three variants - as-is, with only counts returned, and with DISTINCT added. The benchmarking code also records a bit of information about the output - counts for multiple results and a single value for single results. The latter provided the first indication that different engines produce different results for numeric and GeoSPARQL values.

@dcausse Quick question regarding the weekly Wikidata dumps on https://dumps.wikimedia.org/wikidatawiki/entities . The last dump of latest-all.ttl.bz2 is from 29.01.2025, that is, over a month ago. Did something go wrong or are these dumps no longer supported?

@Hannah_Bast these dumps are still supported. Something went wrong in the software responsible for generating these dumps. Please see T386401 (and T384625 for the root cause).