Page MenuHomePhabricator

Evaluate Virtuoso as alternative to Blazegraph
Open, LowPublic

Description

The purpose of this task is evaluating if Virtuoso can serve as alternative to Blazegraph as a backend for Wikidata Query Service. We need to check:

  1. Whether it is feasible to install Open Source version of Virtuoso in our production environment
  2. Whether is is possible to load Wikidata SPARQL dump into Virtuoso and efficiently run queries
  3. Which functionality differences exist between current implementation of WDQS and Virtuoso and whether it is possible/feasible to bridge them and what it would cost

Event Timeline

Smalyshev triaged this task as Medium priority.Oct 9 2018, 6:32 PM
Smalyshev created this task.
Smalyshev moved this task from Backlog to Doing on the User-Smalyshev board.
Smalyshev changed the task status from Open to Stalled.Oct 10 2018, 3:38 PM

Due to operational issues, Virtuoso has been only partially evaulated.

Install went fine, though some small adjustments were required. But nothing major. Performance evaluation for loading and some queries looked fine, in fact parallel loading were faster than in Blazegraph cases, though it required some changes to the data due to much stricter requirements Virtuoso has for geographical data. In general, I think there's a good chance performance is OK.

Functionality-wise, there's a significant delta between what we provide now and what Virtuoso can support. I have outlined it here: https://docs.google.com/document/d/1PSVIwuKrc1yeQwXgZmKxP6cqhby4Dfnz-JdQs0pkj8M/edit#
Virtuoso supports all standard SPARQL 1.1 syntax, as far as I could see, but beyond that there are many differences, of course.
Most of the missing functionality seems possible to fill (at least in theory), as Virtuoso provides custom types, custom functions and procedures and a large ecosystem supporting ingestion of data from other sources into RDF and integration capabilities.
However, it is not very likely that we will be able to support these capabilities in the same way and with the same syntax as our current solution does, so script migrations would be necessary. It would also require significant investment of time to develop these solutions in any case.

Virtuoso 7.x Opensource version has only experimental support on recent Debian platforms, but seems to work fine. The code is implemented in C, so it could be possible for us to contribute to it if necessary.
Clustering and HA setups are not available in Opensource version, neither is data replication. Other capabilities not supported in Opensource are graph-based access controls.

In general, I think Virtuoso can be a viable platform in case things with Blazegraph would not be sustainable anymore, and will support basic functionality (standard SPARQL, updating, etc.) adequately, but extensions and some additional capabilities may require significant effort to develop and will incur some migration pains to the users. Unfortunately, since all clustering and replication solutions seem to be non-opensource, we would not be able - at least without either using commercial version or getting some kind of special solution or special deal - to use other paradigm of clustering that we're using now.

Further testing would be necessary to evaluate performance of querying and Updater - the latter would probably require some modifications in order to be able to run against Virtuoso, but does not seem too hard. Since we do not have testing platform for it now and we have renewed activity on Blazagraph side, right now I am not planning to continue the testing in the short term.

Probably not an issue for Wikimedia, but perhaps for embedded usage: as of Virtuoso 7, only 64-bit platforms are supported (enforced by configure, maybe to avoid spurious segmentation fault bugs).

Aklapper changed the task status from Stalled to Open.Nov 4 2020, 10:42 PM
Aklapper lowered the priority of this task from Medium to Low.

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

I just want to note my experience running sparql.uniprot.org. Which is currently about 10 times the size of wikidata in number of triples. An old but still applicable slide set/video is https://www.youtube.com/watch?v=lSfeYdHfCGQ. We also have very large literals that would require blobs in most relational databases.

Specifically regarding the feature list: most of these should be implemented in a custom front end. Which is what we do and is something we have started to open source and contribute to RDF4J an eclipse project as new spring-boot module.

We implement readonly users for the sparql endpoint (plus double query parsing, again with RDF4J) so that only read sparql queries can be submitted. If that was broken we would still have the user permissions to protect the store (defence in depth).
Dynamic runtime query time limit: is configurable and can be turned off.
We have it on but on a much higher allowances than the current WQS settings.
We also have a simple abuse protection based on IP and queries per second again in the middle.
Extension can be done in the middle (servlet) (e.g. by query rewriting) or in virtuoso with custom functions.
The front end can do proper http caching support that can be written to understand when the store actually changes.

Our solution for clustering is just have more copies of the whole database.

By chance a post on how to load all of wikidata into a virtuoso instance appeared today.

@Jerven: these are two great updates, thank you. Is there any more recent presentation you could share?
Perhaps you could give a brown bag talk and record it as one :)

Working with Kingsley + others @ OpenLink on evaluating this makes a lot of sense.

I took a glance at Virtuoso.

I found nothing about scaling Virtuoso to a cluster (which is IMO what WMF needs because of growing amounts of data and reaching the limits of what 1 machine can handle)

A snippet from WP:
"Virtuoso is designed to take advantage of operating system threading support and multiple CPUs. It consists of a single process with an adjustable pool of threads shared between clients. Multiple threads may work on a single index tree with minimal interference with each other. One cache of database pages is shared among all threads and old dirty pages are written back to disk as a background process."

Virtuoso IMO is not the way forward for WMF. We need a distributed graph/column database with SPARQL on top. See https://phabricator.wikimedia.org/T289561 for an application that has exactly that (but seems abandoned since dec 2020 unfortunately)

I took a glance at Virtuoso.

I found nothing about scaling Virtuoso to a cluster (which is IMO what WMF needs because of growing amounts of data and reaching the limits of what 1 machine can handle)

A snippet from WP:
"Virtuoso is designed to take advantage of operating system threading support and multiple CPUs. It consists of a single process with an adjustable pool of threads shared between clients. Multiple threads may work on a single index tree with minimal interference with each other. One cache of database pages is shared among all threads and old dirty pages are written back to disk as a background process."

Virtuoso IMO is not the way forward for WMF. We need a distributed graph/column database with SPARQL on top. See https://phabricator.wikimedia.org/T289561 for an application that has exactly that (but seems abandoned since dec 2020 unfortunately)

Again, Virtuoso 7.x Open Source Edition scales up to 80 Billion Triples as demonstrated by Uniprot's live instance.

You don't need the Virtuoso Cluster Edition until the scalability of the single-server edition is exhausted. Wikidata is a long way from reaching 80 Billion+ triples.

Key factor here is memory which you get via virtual machines, courtesy of cloud services these days.

Virtuoso has also hosted DBpedia for the last 14 years i.e., since its inception . Growth hasn't been an issue, and won't be going forward.

I hope that helps.

Kingsley

Unfortunately, since all clustering and replication solutions seem to be non-opensource, we would not be able - at least without either using commercial version or getting some kind of special solution or special deal - to use other paradigm of clustering that we're using now.

This might not be true. See https://phabricator.wikimedia.org/T289561

I took a glance at Virtuoso.

I found nothing about scaling Virtuoso to a cluster (which is IMO what WMF needs because of growing amounts of data and reaching the limits of what 1 machine can handle)

A snippet from WP:
"Virtuoso is designed to take advantage of operating system threading support and multiple CPUs. It consists of a single process with an adjustable pool of threads shared between clients. Multiple threads may work on a single index tree with minimal interference with each other. One cache of database pages is shared among all threads and old dirty pages are written back to disk as a background process."

Virtuoso IMO is not the way forward for WMF. We need a distributed graph/column database with SPARQL on top. See https://phabricator.wikimedia.org/T289561 for an application that has exactly that (but seems abandoned since dec 2020 unfortunately)

Again, Virtuoso 7.x Open Source Edition scales up to 80 Billion Triples as demonstrated by Uniprot's live instance.

You don't need the Virtuoso Cluster Edition until the scalability of the single-server edition is exhausted. Wikidata is a long way from reaching 80 Billion+ triples.

Key factor here is memory which you get via virtual machines, courtesy of cloud services these days.

Virtuoso has also hosted DBpedia for the last 14 years i.e., since its inception . Growth hasn't been an issue, and won't be going forward.

I hope that helps.

Kingsley

Are you https://www.wikidata.org/wiki/Q6413347 (CEO of OpenLink Software who makes Virtuoso)? In the case yes, I suggest you state your bias clearly when writing here.

@KingsleyIdehen maybe you can help shed some light on the questions about Virtuoso here?

Data representation
Geographical triples specifying coordinate system can not be loaded: "http://www.wikidata.org/entity/Q405 Point(-49.2 -21.5)"^^http://www.opengis.net/ont/geosparql#wktLiteral
see https://github.com/openlink/virtuoso-opensource/issues/455
"Point(-342.2 -52.7)"^^geo:wktLiteral not accepted by Virtuoso (more strict about the points than we are)
No extended date range support
Virtuoso has custom type support, so may be fixable
IRI inlining seems to be standard database normalization, thus we’d probably have to use long (64-bit) IDs, which may lead to larger space requirements. But Virtuoso seems to be able to support such scenario.
SPARQL capabilities
No support for custom services
Label service
Geo search (Virtuoso has its own geosearch functionality)
MWAPI bridge
May be replaced by Sponger?
GAS algorithms
Custom functions unavailable
Distance (may be supported under other name)
Coordinate parts
Decode URL
Date math functions work differently (e.g. diff is not days but seconds)
Functions need custom bif: prefix
This can be bridged by custom functions and procedures, but the syntax probably would be different from what we’re using now.
Service capabilities
Dynamic runtime query time limit - not sure whether it’s configurable/changeable
Throttling/concurrency control - probably will need to implement by ourselves
Read-only querying (Virtuoso has user permissions, so we may be able to get that working)
Not clear what extensibility APIs are - since we don’t have Servlet API like in Blazegraph, we may need to work with whatever Virtuoso allows.
Service features
No namespace support, but has graph (quads) support, which probably can replace it
Default prefixes support?
Federation whitelist?
Seems there is a limit for 10000 sql lines - which may be a problem for large queries/updater
LDF endpoint support unclear - may be implemented as proxy
Operational issues
Debian only has 7.x as experimental
Virtuoso has username/password/permission system, which we would need to isolate from outside and to make work with Updater
We will need to isolate other capabilities to reduce security exposure, Virtuoso is a full-scale container of which we won’t be using most capabilities.

I agree with Kingsley that you don't need a distributed SPARQL engine when the knowledge graph fits on a single machine and will do so also in the future. Which is clearly the case for Wikidata, since it's even the case for the ten times larger UniProt (which at the time of this writing already contains over 90 billion triples).

In fact, I would consider distributing the knowledge graph over multiple machines in such a scenario suboptimal because distributing (as opposed to just replicating) the data incurs a significant performance overhead. You distribute only if you need to.

As Jerven pointed out, if you have a high query load, you can just replicate the knowledge graph on multiple machines (one copy per machine) and distribute the queries over these machines. This is simple and effective.