Page MenuHomePhabricator

Evaluate creating LDF server for WDQS
Closed, ResolvedPublic

Description

Blazegraph has Triple Pattern Fragment (Linked Data Fragment) server implementation: https://github.com/hartig/BlazegraphBasedTPFServer

We should look into implementing it to support heavy queries that may produce a lot of results.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
Smalyshev triaged this task as Medium priority.Sep 12 2016, 10:44 PM

Change 317114 had a related patch set uploaded (by Smalyshev):
[WIP] Implement LDF server for Blazegraph

https://gerrit.wikimedia.org/r/317114

Change 317114 merged by jenkins-bot:
Implement LDF server for Blazegraph

https://gerrit.wikimedia.org/r/317114

One thing about LDF that isn't really clear to me is who controls the available triple patterns. Our RDF mapping is quite complex, we would need to support a large number of patterns to allow efficient queries via LDF.

Would it be up to us to decide which patterns to support? Or do clients just request any patterns they are interested in, and we generate and cache the response on the fly?

My concern is that we would either need to cache a huge number of fragments to allow efficient queries.

I also don't quite see how fragments can be paged in the absence of a unique key. The spec does not say anything about this, as far as I can see.

Or do clients just request any patterns they are interested in, and we generate and cache the response on the fly

Yes. Just as with SPARQL queries, only triple pattern queries are really really basic.

we would either need to cache a huge number of fragments to allow efficient queries.

Triple pattern queries should be very fast since they go directly against indexes and aren't supposed to require any calculations. We'll see of course if any performance issues arise.

I also don't quite see how fragments can be paged in the absence of a unique key

I'm not sure how paging is implemented internally. But again, since triple patterns are pretty much reading the index, I don't think it should be too problematic.

Triple pattern queries should be very fast since they go directly against indexes and aren't supposed to require any calculations. We'll see of course if any performance issues arise.

The queries are trivial, but the result sets are potentially very very large. That's what worries me.

If I understand correctly, we'll be generating and caching each tripe in blazegraph multiple times - once for every triple pattern that matches it.

Adding Darian and Faidon, since this has potential ops impact due to large amounts of data to be cached and served.

The queries are trivial, but the result sets are potentially very very large.

True, but also true for SPARQL queries. People do million-item queries right now. With LDF they at least get proper paging and not bring down the server while doing it, hopefully.

If I understand correctly, we'll be generating and caching each tripe in blazegraph multiple times

Not sure what you mean by this. The query result will of course be generated anew for each different query - this is true for any query. It also would be cached by varnish - this is also true for any query, and can be configured. Since query result is naturally paged, how much data is cached will depend on the query (just as in SPARQL case) and the client actually consuming the data. I imagine varnish is supposed to be able to handle such cases, but if not we can change caching parameters to make it easier for varnish.

Since query result is naturally paged, how much data is cached will depend on the query (just as in SPARQL case) and the client actually consuming the data.

I think we have different understandings of how the paging works. It would be good to have clarity on this.

As far as I understand, for a pattern like ( ? P31 Q5 ), the *entire* result set would be generated before paging applies, and all pages of the set would be cached preemptively. That's a lot of data to generate and store, especially since there will be a lot of such P31 datasets.

Maybe I'm wrong, and the query is run with a LIMIT and ORDER, and a continuation based on the the subject ID. That would of course be much better. Can you confirm that this is actually the case?

the *entire* result set would be generated before paging applies,

I don't think this is what happens, looking at the code, at least not if I understand "generated" correctly.
See the implementation at:
https://github.com/wikimedia/wikidata-query-rdf/tree/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/ldf

Specifically BlazegraphBasedTPF and Blazegraph iterators it uses. The code there is kind of complex but it doesn't look like it does what you suggest it does, at least if I understand it right.

and all pages of the set would be cached preemptively

Not sure which cache do you mean here.

and the query is run with a LIMIT and ORDER,

It's not SPARQL query, so it's not run this way, but the iterator it uses does use limit and offset. I'm not sure what happens exactly, I can ask, but doesn't look like it produces the whole data set in any meaning I can think of.

@Smalyshev Thanks for checking! I'm satisfied as long as it doesn't naively dump the whole result somewhere to then chop it into pages.

Change 317282 had a related patch set uploaded (by Smalyshev):
Add configs for LDF server

https://gerrit.wikimedia.org/r/317282

Change 317282 merged by Gehel:
Add configs for LDF server

https://gerrit.wikimedia.org/r/317282

Awesome! I'm existed to see how this turns out!