Page MenuHomePhabricator

better information on recent benchmarking + some comments
Closed, ResolvedPublic

Description

[This seems like an odd way to request information, but it appears to be what is desired.]

I would like to see the raw results of the benchmarks as they do not align with my benchmarking. I would also like to see how the benchmarking was set up and run.

Some of the conclusions in the benchmarking document do not appear to be correct. In particular, there is no reason to run the load process on the same machines that are used to run queries. This means that a query machine could have less memory. I wouldn't recommend less memory because a large-memory machine should be able to handle multiple query streams in addition to an update stream, but that doesn't mean that a machine with as little as 64GB of memory could not be an effective WDQS server.

None of the SPARQL engines are completely compliant with the SPARQL 1.1 specification. Quantifying non-compliance is thus a useful thing to do.

The query sets evaluated from WDbench are probably the least interesting. The other query sets pose much more of a test for SPARQL engines.

I believe that it is possible to load Wikidata into Blazegraph, provided that one has sufficient patience. The problem is that a complete load takes a long time and that there is a bug in Blazegraph that appears to be related to parallel access to the data structures during load that causes loading to fail sometimes.

The work done to convert SCHOLIA queries to standard SPARQL 1.1 should be useful to convert WDQS to standard SPARQL 1.1 in general.

Event Timeline

Some of the conclusions in the benchmarking document

Which benchmark document where?

Hey @Pfps ,

Thanks for reaching out. Wikidata.org would be the best place for this kind of inquiries. We had a chance to touch based at the March office hours, and I replied to your follow up on wiki. Hope this clarifies, happy to discuss further.

But in discussing please keep in mind the context of our work. We were very explicit about it in the first paragraph of our documentation, and I want to emphasize it again: the goal of this evaluation was not to identify the single best performing system, but to gather experience with open source triple store implementations and
determine which ones meet minimum criteria as a replacement for Blazegraph
. Minimum criteria that go beyond performance on synthetic tests; see qualitative aspects we reported.

I also want to stress again that no conclusions about performance on production WDQS workloads should be drawn from this document, or the limited set of queries publicly available. The only conclusion we could make is that both QLever and Virtuoso are valid technologies, provide a viable path forward from Blazegraph (which is now abandonware) and need further hands on experience and testing (which we are doing this and next quarter). Viable here means that:

  1. we can load Wikidata in it in reasonable times, with ~2x the number of triples of the current main graph.
  2. they provide index (real-time) update semantics that are critical for Wikipedia-supporting bot and tools.

Direct comparison of results is hard when base hardware, test conditions, software versions, and datasets (Wikidata snapshots) differ. Also, take into account that we did not tune either database for performance and we used relatively vanilla configurations. You can argue that this biases results against databases that require tuning (Virtuoso), but qualitatively in this evaluation spike informs some level of baseline.

there is no reason to run the load process on the same machines that are used to run queries. This means that a query machine could have less memory.

Are you referring to the real-time index updating process?
We co-host it on wdqs nodes because it makes change propagation on non-replicated dbs easier to manage (conceptually and technically). It' s an architecture choice our team inherited, but we find it solid and well engineered. The impact on resources is minimal, and the stream of rdf mutations is relatively low volume even on the main graph. Here's a snapshot from a prod main graph node (128GB ram, 64 cores) where the Java index updater process runs (holler if you need clarifications on ps output):

gmodena@wdqs1025:~$ ps aux|grep consumer
blazegr+ 3471505  0.3  1.1 10308980 1478412 ?    Ssl   2025 315:08 java -cp /srv/deployment/wdqs/wdqs/lib/streaming-updater-consumer-0.3.157-jar-with-dependencies.jar -Xmx1g -Xloggc:/var/log/query_service/wdqs-streaming-updater_jvm_gc.%p.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC -XX:+PrintGCCause -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=20M -Dlogback.configurationFile=/etc/wdqs/logback-wdqs-updater.xml -XX:+UseNUMA -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=9101:/etc/wdqs/wdqs-updater-prometheus-jmx.yaml org.wikidata.query.rdf.updater.consumer.StreamingUpdate --sparqlUrl http://localhost:9999/bigdata/namespace/wdq/sparql --brokers kafka-main1006.eqiad.wmnet:9092,kafka-main1007.eqiad.wmnet:9092,kafka-main1008.eqiad.wmnet:9092,kafka-main1009.eqiad.wmnet:9092,kafka-main1010.eqiad.wmnet:9092 --consumerGroup wdqs1025 --topic eqiad.rdf-streaming-updater.mutation-main --batchSize 250

The process is long running, so the JVM is warm and in a steady-state (e.g. we don't pay initialization costs other than at startup, which is infrequent). It allocates about 1.5GB of ram, and consumes little CPU (which sits mostly idle). The workload is i/o (network) bound, mainly reading from Kafka. If this process was to become a bottleneck, we would have several ways to address it, but so far it has been serving us well for a couple of years.

I wouldn't recommend less memory because a large-memory machine should be able to handle multiple query streams in addition to an update stream, but that doesn't mean that a machine with as little as 64GB of memory could not be an effective WDQS server.

It depends on how you define "effective", the expected access patterns, and target performance envelope.

Hardware specs need to be put in context of workload, access patterns, concurrency and service level objectives. With 64GB of memory it might be possible to index Wikidata in reasonable time, but memory impacts both ingestion times and query performance, with diminishing returns once the working set fits comfortably in RAM. In general memory availability has an impact on read latency and throughput (queries/second). A naive interpretation is that with less memory, concurrent queries start evicting each other's pages from cache, we get thrashing and everyone slows down. Mixed read/write workloads make this worse. But there is a lot more to this than just single numbers.

A more interesting metric might be the cores-to-memory ratio, although I do not yet have a good quantified estimate for that. It may matter less with triplestore implementations than with other database technologies I have worked with. Unfortunately, the instrumentation in this regard in Blazegraph is not very informative and difficult to extract, so we are largely limited to observing behaviour through Grafana. We will likely learn more as we start exercising these systems with production data. Do you happen to have any figures from your experiments?

"WDQS server" is also not an ideal mental model, since right now we have flat fleet setup that serves a mix of access patterns: actors with high latency loads, actors with low latency loads, actors with wide graph traversals, actors with tight SLAs on data freshness, etc. We are investigating (via query log analysis and talking with stakeholders) query profiles to get a better understanding of access patterns and reason about quality of service accordingly. I would love to be able to share more, but unfortunately a lot of this analysis, in its current state, contains PII that would breach Wikimedia's Privacy Policy.

None of the SPARQL engines are completely compliant with the SPARQL 1.1 specification. Quantifying non-compliance is thus a useful thing to do.

Could you help me understand how we could best quantify non-compliance? I would love to have a better understanding of protocol adherence, beyond vendors' documentation. Are there practical pain points in the community that are (or could be) impacted by this?

In the context of the migration, running regression tests is straightfoward: query blazegaph, query another triplestore, compare results sets. This is work we plan to do and automate.

But I would like to go beyond that and make sure that the result sets returned are the "right thing" (as per protocol spec). Do you have a sense of how we could test that at scale (e.g. automated on batches of a few million queries, not just the W3C test suite)?

As an operator of the service I would like WDQS to be orthogonal to the specific triplestore it uses as backend at a given point in time, so standard adherence and compatibility across vendors would be extremely important (even more so than absolute performance envelope, within reason) if we think about platform sustainability and evolution.

The query sets evaluated from WDbench are probably the least interesting. The other query sets pose much more of a test for SPARQL engines.

Possibly. For the context of our migration work, any query running for more than 60 seconds starts to lose meaning as WDQS would time it out. But I appreciate that other parties wanting to reuse our software to spin up their on query service would want higher latency limits.

I would love to get to a point where we could start to consider supporting higher-latency workflows like these. It feeds into a broader consideration of quality of service and SLOs that I'd expect will evolve once we can make a new service available on the public internet.

I believe that it is possible to load Wikidata into Blazegraph, provided that one has sufficient patience. The problem is that a complete load takes a long time and that there is a bug in Blazegraph that appears to be related to parallel access to the data structures during load that causes loading to fail sometimes.

Have you been able to load recent snapshots?
6 out of 6 of my attempts on AWS (r8i.4xlarge) failed after long runtimes (a few days to ~3 weeks). There's a race condition that kicks in after a few billion triples (it varies) and thrashes the journal:

18:16:15.246 [qtp861842890-56015] ERROR c.b.r.sail.webapp.BigdataRDFServlet - cause=java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -332811616993148506: lastRootBlock=rootBlock{ rootBlock=0, challisField=634, version=3, nextOffset=50396166986986826, localTime=1763742764158 [Friday, November 21, 2025 4:32:44 PM UTC], firstCommitTime=1763116489148 [Friday, November 14, 2025 10:34:49 AM UTC], lastCommitTime=1763742750011 [Friday, November 21, 2025 4:32:30 PM UTC], commitCounter=634, commitRecordAddr={off=NATIVE:-123328549,len=422}, commitRecordIndexAddr={off=NATIVE:-129768095,len=220}, blockSequence=32763, quorumToken=-1, metaBitsAddr=50395797631534516, metaStartAddr=11989126, storeType=RW, uuid=21eea3f0-1ae1-4016-9a6d-962d47cdeb72, offsetBits=42, checksum=-189059289, createTime=1763116488499 [Friday, November 14, 2025 10:34:48 AM UTC], closeTime=0}, query=SPARQL-UPDATE: updateStr=LOAD <file:///wdqs/munge/wikidump-000000635.ttl.gz>
java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Problem with entry at -332811616993148506: lastRootBlock=rootBlock{ rootBlock=0, challisField=634, version=3, nextOffset=50396166986986826, localTime=1763742764158 [Friday, November 21, 2025 4:32:44 PM UTC], firstCommitTime=1763116489148 [Friday, November 14, 2025 10:34:49 AM UTC], lastCommitTime=1763742750011 [Friday, November 21, 2025 4:32:30 PM UTC], commitCounter=634, commitRecordAddr={off=NATIVE:-123328549,len=422}, commitRecordIndexAddr={off=NATIVE:-129768095,len=220}, blockSequence=32763, quorumToken=-1, metaBitsAddr=50395797631534516, metaStartAddr=11989126, storeType=RW, uuid=21eea3f0-1ae1-4016-9a6d-962d47cdeb72, offsetBits=42, checksum=-189059289, createTime=1763116488499 [Friday, November 14, 2025 10:34:48 AM UTC], closeTime=0}

I timeboxed an exploration on this code path, but there is non-trivial thread locking going on. Colleagues at WMF looked into this code paths as well in the past, but given Blazegraph state (and the goal to migrate away from it) we decided not to invest in fixing this issue. Ultimately we were able to bootstrap blazegraph by taking a snapshot of the legacy endpoint journal.

If you have been able to load recent entities dump (October 2025 onward), would you mind sharing your setup (ram, cores, disk iops), memory allocation patterns (if known) and total runtimes?

The work done to convert SCHOLIA queries to standard SPARQL 1.1 should be useful to convert WDQS to standard SPARQL 1.1 in general.

We are still in an investigation phase, but our goal is to stick as much as possible to standards. As we build a better understanding of Blazegraph features, how much they are used in wdqs traffic, and possible paths forwards, we should be better informed to engage with community efforts (and know what to ask). Thanks for the input you provided in T414453: [SPIKE] How to handle porting of label and mwapi services to the new backend (and subtasks). It was very valuable. We are iterating on this, but it seems that at least for wikibase:label there could be vendor-independent solutions to automate conversion, that won't break existing queries and make the transition to a new backend less painful to users.

I'm resolving this task because this type of conversation is better suited for wikidata.org rather than phabricator.

@Pfps if there's follow ups, please feel free to tag me directly on wiki.

The reason that opened this ticket is that the team has been poor at responding to comments in the Wikidata wiki. So I looked in https://www.mediawiki.org/wiki/Wikidata_Platform#How_to_contact_us and the only contact methods there say to open Phabricator tickets. So I did.

If the team wants to entertain input from other places, and they should, the direction in the above document should be changed.

But saying to take this discussion to wikidata.org is not effective guidance. First, I can only assume that you mean the Wikidata wiki, i.e., pages under https://www.wikidata.org/wiki. But what page? I would like to have the discussion in some place that is discoverable by other people who are interested in the WDQS so that seems to rule out any of "my" pages. I supposed I could use one of the pages that the team should be monitoring, but which one? I think that the best idea would be for the team to set up one or more "chat" pages, like https://www.wikidata.org/wiki/Wikidata:Project_chat, where the community can contact the team on general issues or issues where there is no obvious other page or issues that should be seen by the entire interested community. But I don't think that any such pages have been set up.

So I await information on where to re-initiate this conversation, ideally in a place mentioned as suitable for the purpose in https://www.mediawiki.org/wiki/Wikidata_Platform#How_to_contact_us