Create list of criteria for graph backend candidates for WDQS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MPhamWMF
	Sep 16 2021, 5:19 PM

Description

As a WDQS maintainer, I want to be able to evaluate graph backend candidates for migrating WDQS off of Blazegraph, so that I can create a ranking/survey of alternatives, and ultimately choose the optimal one.

Prior candidate list and survey from when Blazegraph was chosen: https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing

We will likely not use the same list as before, and will need to create a new list of criteria (and weighting of those criteria). The final list will aim to combine technical scaling considerations, as well as a relatively small finite list of community-sourced criteria (in the case that they do not totally overlap). While there will eventually be a better process for consolidating a final list of community-sourced criteria, (comments in) this ticket can be used to start collecting ideas for criteria.

AC:

a list of criteria to evaluate graph backends for the purpose of scaling WDQS

Related Objects
Search...

Status	Assigned	Task
Open	None	T206560 [Epic] Evaluate alternatives to Blazegraph
Resolved	AWesterinen	T275398 Create an updated survey of graph backends for WDQS
Resolved	AWesterinen	T291207 Create list of criteria for graph backend candidates for WDQS

Event Timeline

MPhamWMF created this task.Sep 16 2021, 5:19 PM

MPhamWMF triaged this task as High priority.Sep 20 2021, 2:00 PM

MPhamWMF moved this task from Incoming to Scaling on the Wikidata-Query-Service board.

QLever - https://github.com/ad-freiburg/qlever - https://scholia.toolforge.org/work/Q108730896

The paper reports benchmarks favorable for QLever. I cannot get path queries working on the public endpoint.

(QLever could be a candidate. It is not a criteria)

Hernández, Daniel & Hogan, A. & Krötzsch, M.. (2015). Reifying RDF: What works well with wikidata?. 1457. 32-47.

Abstract: In this paper, we compare various options for reifying RDF triples. We are motivated by the goal of representing Wikidata as RDF, which would allow legacy Semantic Web languages, techniques and tools - for example, SPARQL engines - to be used for Wikidata. However, Wikidata annotates statements with qualifiers and references, which require some notion of reification to model in RDF. We thus investigate four such options: (1) standard reification, (2) n-ary relations, (3) singleton properties, and (4) named graphs. Taking a recent dump of Wikidata, we generate the four RDF datasets pertaining to each model and discuss high-level aspects relating to data sizes, etc. To empirically compare the effect of the different models on query times, we collect a set of benchmark queries with four model-specific versions of each query. We present the results of running these queries against five popular SPARQL implementations: 4 store, BlazeGraph, GraphDB, Jena TDB and Virtuoso.

Consider graph databases that support RDF-star and SPARQL-star such as RDF4J, AnzoGraph and GraphDB since they are proposed extensions to the RDF and SPARQL standards to provide a more convenient way to annotate RDF statements and to query such annotations (wikidata qualifiers and references), bridging the gap between the RDF world and the Property Graph world.

See W3C Draft Community Group Report 01 July 2021
https://www.w3.org/community/rdf-dev/2021/07/02/new-public-draft-of-the-rdf-star-report/

https://rdf4j.org/documentation/programming/rdfstar/
https://graphdb.ontotext.com/enterprise/devhub/rdf-sparql-star.html
https://cambridgesemantics.com/anzo-platform/

Iamamz3 subscribed.Nov 3 2021, 11:42 AM

Re the prior candidate list:

Prior candidate list and survey from when Blazegraph was chosen: https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing

It would be great to have annotated scale to help figure what software looks like the best candidate, and avoid gut jugdment.

Given a "multi operation ACID", it might look like:

0: No ACID guarantees
1: ACID guarantees for primary representation, but async secondary representations (indices)
2: ACID both primary and secondary representations

Regarding ACID in particular, there is another lever that is "isolation level", see https://www.postgresql.org/docs/current/transaction-iso.html.

Also the current scale 0-10 is way to large, it is too much work to document for every row what every number means between zero and ten.

It seems clear that that sheet is just an indicator, and only gives clues of what might work best, and grading well on that can not be the primary motivation for picking a solution.

Design for ~10X growth, but plan to rewrite before ~100X

Jeff Dean, “Challenges in Building Large-Scale Information Retrieval Systems,” Google, http://static.googleusercontent.com/media/research.google.com/en//people/jeff/WSDM09-keynote.pdf

YULdigitalpreservation subscribed.Nov 11 2021, 2:14 PM

AndySeaborne subscribed.Nov 20 2021, 8:09 AM

nguyenm9 subscribed.Nov 22 2021, 3:11 PM

Daniel_Mietchen subscribed.Dec 11 2021, 7:46 AM

Jneubert subscribed.Jan 31 2022, 1:59 PM

MPhamWMF moved this task from Scaling to Current work on the Wikidata-Query-Service board.Feb 14 2022, 4:33 PM

MPhamWMF added a project: Discovery-Search (Current work).

MPhamWMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Feb 22 2022, 8:02 PM

Gehel assigned this task to AWesterinen.Feb 28 2022, 4:28 PM

MPhamWMF moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Mar 29 2022, 1:35 PM

Criteria are defined in the paper, WDQS Backend Alternatives, published on the page, https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_alternatives.