Page MenuHomePhabricator

Evaluate Apache Rya as alternative to Blazegraph
Open, MediumPublic

Description

Evaluate Rya an alternative to Virtuoso with a hadoop based columnar store that scales easily.

"Apache Rya is a scalable RDF Store that is built on top of a Columnar Index Store (such as Accumulo). It is implemented as an extension to RDF4J to provide easy query mechanisms (SPARQL, SERQL, etc) and Rdf data storage (RDF/XML, NTriples, etc)."

Pro:

  • scales by having Accumulo which is build on Hadoop cluster
  • has SPARQL endpoint
  • Apache top-level project

Con:

  • no commits since dec 2020
  • few stars and forks in Github

bild.png (769×1 px, 62 KB)

See also https://www.wikidata.org/wiki/Wikidata:WikiProject_Limits_of_Wikidata#Apache_Rya

Event Timeline

So9q updated the task description. (Show Details)
Gehel triaged this task as Medium priority.Aug 30 2021, 3:12 PM
Gehel moved this task from All WDQS-related tasks to Scaling on the Wikidata-Query-Service board.
Gehel added a subscriber: Gehel.

The Search Platform team will dig into this when we start work on evaluating Blazegraph alternatives

Slides https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf
Pros:

  • pre-computed joins
  • indexes based on SPO, OPS, POS
  • lots of query-optimization

The dev mailing list of Rya is unfortunately very quiet. https://www.mail-archive.com/dev@rya.apache.org/maillist.html

For some reason there was no one who volunteered to host a presentation on ApacheCon this year:
https://www.mail-archive.com/dev@rya.apache.org/msg00132.html

As a result there is no "RDF / Linked data track" this year. :/

Here is the report for 2021:

Description:

The mission of Apache Rya is the creation and maintenance of software
related
to scalable storage, retrieval, and analysis of RDF data

Issues:

There are no issues requiring board attention

Membership Data:

Apache Rya was founded 2019-09-17 (2 years ago)
There are currently 12 committers and 11 PMC members in this project.
The Committer-to-PMC ratio is roughly 1:1.

Community changes, past quarter:

  • No new PMC members. Last addition was Adina Crainiceanu on 2019-09-17.
  • No new committers were added.

Project Activity:

Apache Rya 4.0.1 was released on 2020-12-22.
Planning to submit a proposal for ApacheCon RDF/linked data track

Community Health:

Not much activity in the past quarter, after the last release in December

PMC is discussing ways to increase activity and the number of committers
100 subscribers to the dev list
dev@rya.apache.org had a 78% decrease in traffic in the past quarter
(12 emails compared to 54)
notificati...@rya.apache.org had a 94% decrease in traffic in the past
quarter
(5 emails compared to 82)
0 issues opened in JIRA, past quarter (-100% decrease)
0 issues closed in JIRA, past quarter (-100% decrease)
0 commits in the past quarter (-100% decrease)

The streams mentioned in T291089 by Addshore could be used to populate a Apache Rya backend also, probably with little effort as it is built on Apache Accumulo which uses Hadoop.

So9q renamed this task from Evaluate Rya as alternative to Blazegraph to Evaluate Apache Rya as alternative to Blazegraph.Mon, Sep 27, 9:56 PM

We looked a bit into Apache Rya. A couple of observations:

  1. The instructions on https://github.com/apache/rya are a mess. Compiling the code requires an old version of the JDK (version 8), which is written nowhere and tooks us some time to find out. Compilation takes forever. The instructions concerning getting a working Rya server are cryptic, mentioning all kinds of other libraries and projects, but without instructions on how exactly to install them. Loading the data also seems to be non-trivial: you have to write code for this. It's certainly all doable, but this does not look like a well-maintained project.
  1. We had a look at the 2012 paper https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf (which is well cited) and the 2017 slides https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf . The slides are in sync with what is written in the paper, and they are very instructive in understanding how the engine works. It also looks to me like they describe the current state of Rya (that is, there have not been any major changes to the basic architecture since then).
  1. The underlying data store (Accumulo or MongoDB) is used only for storing the raw data (the triples). The actual operations on this data (like the JOIN operations, which are central for processing SPARQL queries) are done by the Rya code. This makes sense because a NoSQL store like MongoDB does not support JOIN operations, that's just not what it's made for.
  1. The basic principle of Rya JOIN operations is explained on slide 15 on the presentation, and variations of it on slides 16, 18, 31, and 32. The basic principle is to start with the most selective triple, consider the set of matching entities for that triple (which is hopefully small) and then look up each of these (hopefully few) entities in the appropriate index.
  1. This principle is efficient only when you have at least one highly selective triple in your SPARQL query. In the paper mentioned above, Rya is evaluated on the the Lehigh University Benchmark (LUBM), which is a well-known but rather old benchmark, with rather special queries. Namely, all queries have at least one very selective triple, typically of the kind "variable <type> <some fixed type>". There is not a single query, with a triple for the <type> predicate, where the object is also a variable.
  1. When you don't have a non-selective triples, Rya is bound to be slow because it then has to deal with very large sets entities, which it will look up one by one. Also, Rya is not really made to be particular efficient on a single machine. Its main purpose is to be efficient when distributed over several machines. We have already discussed that it does not make sense to distribute a moderate-sized dataset like Wikidata over several machines when you can easily process it on a single machine. Distributing a dataset always incurs a large performance overhead (because you need to send data back and forth between different machines during query processing) and you only do it when you have to.
  1. Rya's performance bottleneck is actually very similar to that of Blazegraph. When you look at the many example queries for the WDQS on https://query.wikidata.org , almost none of them require the computation of a large intermediate result. For the simple reason that such queries don't work well with Blazegraph (they take forever or time out). Large intermediate results occur either when you have no single very selective triples in your query or when there is no LIMIT or the LIMIT is preceded by an ORDER BY or GROUP BY (so that you have to compute a large intermediate result before you can LIMIT it to the top-ranked items).

In summary, Rya does not look like a good choice for several reasons, most notably: not well-maintained, efficient only for quite particular kinds of queries, and similar performance bottlenecks as Blazegraph.

PS: All that being said, I would be thrilled if someone could dig through the cryptic instructions and provide a Rya-based SPARQL endpoint for the complete Wikidata. Then we could confirm (or disprove) the suspicions mentioned above very easily.

🤩 big thanks for sharing this!

We looked a bit into Apache Rya. A couple of observations:

  1. The instructions on https://github.com/apache/rya are a mess. Compiling the code requires an old version of the JDK (version 8), which is written nowhere and tooks us some time to find out. Compilation takes forever. The instructions concerning getting a working Rya server are cryptic, mentioning all kinds of other libraries and projects, but without instructions on how exactly to install them. Loading the data also seems to be non-trivial: you have to write code for this. It's certainly all doable, but this does not look like a well-maintained project.

I'm sorry to hear that. I wrote the last committer a while back and have yet to receive a response. Not a good sign.

  1. We had a look at the 2012 paper https://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf (which is well cited) and the 2017 slides https://events.static.linuxfound.org/sites/events/files/slides/Rya_ApacheBigData_20170518.pdf . The slides are in sync with what is written in the paper, and they are very instructive in understanding how the engine works. It also looks to me like they describe the current state of Rya (that is, there have not been any major changes to the basic architecture since then).
  1. The underlying data store (Accumulo or MongoDB) is used only for storing the raw data (the triples). The actual operations on this data (like the JOIN operations, which are central for processing SPARQL queries) are done by the Rya code. This makes sense because a NoSQL store like MongoDB does not support JOIN operations, that's just not what it's made for.
  1. The basic principle of Rya JOIN operations is explained on slide 15 on the presentation, and variations of it on slides 16, 18, 31, and 32. The basic principle is to start with the most selective triple, consider the set of matching entities for that triple (which is hopefully small) and then look up each of these (hopefully few) entities in the appropriate index.
  1. This principle is efficient only when you have at least one highly selective triple in your SPARQL query. In the paper mentioned above, Rya is evaluated on the the Lehigh University Benchmark (LUBM), which is a well-known but rather old benchmark, with rather special queries. Namely, all queries have at least one very selective triple, typically of the kind "variable <type> <some fixed type>". There is not a single query, with a triple for the <type> predicate, where the object is also a variable.
  1. When you don't have a non-selective triples, Rya is bound to be slow because it then has to deal with very large sets entities, which it will look up one by one. Also, Rya is not really made to be particular efficient on a single machine. Its main purpose is to be efficient when distributed over several machines. We have already discussed that it does not make sense to distribute a moderate-sized dataset like Wikidata over several machines when you can easily process it on a single machine. Distributing a dataset always incurs a large performance overhead (because you need to send data back and forth between different machines during query processing) and you only do it when you have to.

Interesting, I thought Wikidata was getting too big for 1 machine, but I might misunderstood the WMF operations team and the statements in the tickets surrounding BG.

Wikidata could easily triple in the number of triples within a year if all horses are let loose and people start importing all scientific papers, books and chemicals in Wikipedia and all the authors associated with those.

  1. Rya's performance bottleneck is actually very similar to that of Blazegraph. When you look at the many example queries for the WDQS on https://query.wikidata.org , almost none of them require the computation of a large intermediate result. For the simple reason that such queries don't work well with Blazegraph (they take forever or time out). Large intermediate results occur either when you have no single very selective triples in your query or when there is no LIMIT or the LIMIT is preceded by an ORDER BY or GROUP BY (so that you have to compute a large intermediate result before you can LIMIT it to the top-ranked items).

Interesting! I was unaware of this, but it makes sense from my interactions with BG.

In summary, Rya does not look like a good choice for several reasons, most notably: not well-maintained, efficient only for quite particular kinds of queries, and similar performance bottlenecks as Blazegraph.

Big thanks for taking the time to look into this. Rya was the least bad choice IMO until I read your insights.