Page MenuHomePhabricator

Evaluate Halyard as alternative to Blazegraph
Closed, DeclinedPublic

Description

An HBase/RDF4J based horizontal scaling sparql service originally developed by a team at Merck.
https://github.com/Merck/Halyard

Related Objects

Event Timeline

Gehel triaged this task as Medium priority.Aug 30 2021, 3:13 PM
Gehel moved this task from Incoming to Scaling on the Wikidata-Query-Service board.

The Search Platform team will dig into this when we start work on evaluating Blazegraph alternatives

I researched this solution a little:
https://merck.github.io/Halyard/img/architecture.png

architecture.png (1×2 px, 677 KB)

The statement pattern to Hbase scan pattern is crucial to perfomance. It would be interesting to test at system like this with Wikidata and make some test queries.

bild.png (574×1 px, 203 KB)

Here is a thread about how to triple load performance https://groups.google.com/g/halyard-users/c/-owRnSjd26U

Unfortunately, the software does not seem to be maintained nor does it have a large user base if judged by the google-groups as seen here:

bild.png (731×1 px, 174 KB)

That's a bit disappointing b/c it does look like it can scale and has been run through some paces. https://www.linkedin.com/pulse/halyard-tipstricks-trillion-statements-challenge-adam-sotona/

I'm actually trying to get this to compile with the latest versions but a few things have changed since them so it's a bit of a sludge.

Here is their sparql evaluation strategy:

Actual Halyard Evaluation Strategy turns the previous model inside-out. I call it "PUSH Model". The SPARQL query is transformed into a chain (or tree) of pipes (Binding Set Pipe) and then it is asynchronously filled with data. An army of working threads periodically take requests with the highest priorities from the priority queue and perform them (usually by requesting the data from the underlying store and by processing them through the pipes). Each working thread can serve its own synchronous requests to the underlying storage system or process the data through the system almost independently of the others. There are two critical parts of the model implementation to make it really working. One hard part is synchronisation of the joints, where bad synchronisation leads to data corruption. And the second (with the same importance) is perfect balancing of the thread workers jobs. It was critical to design the system to do not let thread workers block each other. When most of the thread workers are blocked, it leads to the performance similar to the previous model. Halyard Strategy handles the worker threads jobs in a priority queue, where the priority is determined from the position in the parsed SPARQL query tree. Pipe iterations and active pumps are another methods to connect Halyard Strategy model with the original RDF4J API (or in some unfinished cases also with Iterations implemented in the original model).

For example your have a SPARQL query containing inner join. The request for data from left part of the join is enqueued with priority N. A worker thread that asynchronously delivers that data to the left pipe of the join also enqueues a request to receive relevant data from the right part of the join (with priority N+1). The higher priority of the right part here is very important to reflect the fact that once you get the right data, you can finish the join procedure and “reduce” the cached load and proceed down the pipes. However (based on the priority queue) the other worker threads can simultaneously prefetch more data for the left part of the join. In ideal situation you can see a continuous CPU load of all thread workers in a connected Java Profiler.

I should mention some numbers here. According to the experiments the Halyard Strategy has been approximately 250 times faster with 50 working threads and a SPARQL query containing 26 various joins. The effectivity of the Halyard Strategy is higher with more joins and unions. However feel free to compare my experimental measurements with your own. Both strategies can be individually selected for each Halyard repository. For an experiment you can set up two repositories (both pointing to the same data) with different SPARQL evaluation strategies.

source: https://www.linkedin.com/pulse/inside-halyard-2-when-one-working-thread-enough-push-versus-sotona

@Hannah_Bast maybe this interests you? Do you think this system would perform well considering the load on WDQS and the type of queries we have?

That's a bit disappointing b/c it does look like it can scale and has been run through some paces. https://www.linkedin.com/pulse/halyard-tipstricks-trillion-statements-challenge-adam-sotona/

I'm actually trying to get this to compile with the latest versions but a few things have changed since them so it's a bit of a sludge.

Interesting! I have never played with Java software to date, good luck!
I looked at the source code, and it seemed to be well structured/well written/clear code, so Adam seems like a competent developer.

Maybe WMF can hire him for consulting work if they choose to go this way.

That's a bit disappointing b/c it does look like it can scale and has been run through some paces. https://www.linkedin.com/pulse/halyard-tipstricks-trillion-statements-challenge-adam-sotona/

I found this interesting in the article above:

however requesting limited number of results from ordered data requires for Halyard to calculate all results, order them and cut the limit (questions like: "give me first ten from results sorted by label"). This is brief explanation why I used modified BSBM queries (with removed ORDER BY) and so why the results cannot be directly compared to the original BSBM benchmarks.

A double backend strategy with QLever could be used for queries like these which it is very good at. Then the Halyard endpoint can be used for other types of queries which it is better suited at.

A bonus could be to make a single user interface for both QLever and Halyard and help the user optimize the query and chose the most optimal endpoint.

The project was mainly driven by Merck and according to one of the (at least former) persons involved in the project, it is no longer developed there: https://twitter.com/jindrichmynarz/status/1424976369495199744

I wouldn't discount this project just yet as Merck may have decided to move away from rdf altogether. It is a bit of a commitment and for profit companies need to juggle costs. I also say that bc I have been able to get it to compile against fairly recent libs and prelim play looks promising. Functionally, there's quite a bit there that I think would be useful here (bulk load, bulk export, updates, etc all through mapreduce). Will be several weeks before I can pass more judgement but have seen enough to say let's keep this door open and encourage more eyes on it.

Hi, I would like to suggest my Halyard fork, Halyard* (supports RDF*), https://github.com/pulquero/Halyard, for consideration. It contains numerous non-trivial changes beyond the original to handle high-volume transactional type queries. And more crucially, it contains fixes to pass some previously skipped sparql test suite cases.

I'm very familiar with the internal workings, and I'm happy to offer any help to aid with the evaluation. IMO, Halyard does alot of things right, and I'm interested in trying to keep the codebase going. It would be a shame for it to disappear into obscurity - to my mind there is no worthy opensource alternative, I spent a lot of time looking for one (big data RDF store) before discovering Halyard.

@Pulquero Thank you for this interesting piece of information. I have a few questions:

  1. Do you have a running SPARQL endpoint for Halyard* on Wikidata for us to play around with?
  2. If not, how hard would it be for you to set one up?
  3. Is it a necessity to use Hadoop or can Halyard* also be used on a single machine? I am asking this because one machine suffices to serve the complete Wikidata. Distributing the data over multiple machines incurs overhead, which is usually very significant.
  4. Is there any way to contact you outside of this thread? I didn't find any contact information on your GitHub profile.

@Hannah_Bast maybe this interests you? Do you think this system would perform well considering the load on WDQS and the type of queries we have?

@So9q Sorry, Dennis, I forgot to answer this question of yours. In general, I think that any system that is discussed in this or the other threads should at least come with a running instance on a reasonably recent dump of the complete Wikidata, together with a brief specification of the machine on which that instance is running. Hence my question to @Pulquero . Then one can very quickly assess with a few SPARQL queries whether that particular system is an interesting option or not. It's of course fine if a system is still being developed, but a working instance that allows at least a reasonable subset of SPARQL is really the minimum. Otherwise, discussions become academic very quickly.

  1. No.
  2. I'm happy to try to work with someone here to make that happen. I've already been in discussions with @nguyenm9.
  3. You need at least a single HBase node. It can all be run on a single machine. What is the spec of the current machine you are using?
  4. mj hale at yahoo com (use dots instead of spaces)

A general question, is there a list of technical shortcomings of the existing blazegraph solution?

@Pulquero AFAIK the two main problems with Blazegraph are:

  1. The project is not really active anymore: https://github.com/blazegraph/database . The reason is that the Blazegraph team was acqui-hired by Amazon a few years ago, and Blazegraph essentially became Neptune, Amazon's proprietary graph database https://en.wikipedia.org/wiki/Blazegraph
  1. Blazegraph has several performance issues. The Wikidata Query Server at https://query.wikidata.org uses a variety of hacks to make Blazegraph usable for typical queries. For example, there is a dedicated SERVICE to handle the huge rdfs:label predicate and the original Wikidata is pre-processed in various ways to make it more palatable for Blazegraph. But it's still rather slow compared to other engines. Queries with large result sets always time out.

In T289621#7685585, @Pulquero wrote:

Hi, I would like to suggest my Halyard fork, Halyard* (supports RDF*), https://github.com/pulquero/Halyard, for consideration.

Note that RDF* is no longer being developed, having been dropped in favor of RDF-star. Though these are pronounced the same, they have very different meaning.

Neither RDF* nor RDF-star has been published as a ratified specification. RDF* was an exploratory and now obsolete paper some years ago. RDF-star is currently being incubated in a focus group of the W3C RDF-DEV Community Group, and a charter for a W3C RDF-star Working Group is currently being drafted.

The current RDF-star draft differs significantly from the original RDF* concept paper, and implementations of either may not be interoperable with other implementations of either.

MPhamWMF lowered the priority of this task from Medium to Low.Mar 29 2022, 1:33 PM

That is a shame, I've been making improvements to really give the other stores a run for their money. Well, for reference, the results so far are that in a single-box configuration, the current dump requires <1.5TB of disk space loaded. And a preliminary test of the difficult query below without full optimisations enabled, was taking around ~minutes to execute (I'm in the process of testing it properly).

PREFIX wd: http://www.wikidata.org/entity/
PREFIX wdt: http://www.wikidata.org/prop/direct/
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?person ?name (COUNT(?profession) AS ?count) WHERE {

?person wdt:P31 wd:Q5 .
?person wdt:P106 ?profession .
?person rdfs:label ?name .
FILTER (LANG(?name) = "en")

}
GROUP BY ?person ?name
ORDER BY DESC(?count)
LIMIT 100