Blazegraph has Triple Pattern Fragment (Linked Data Fragment) server implementation: https://github.com/hartig/BlazegraphBasedTPFServer
We should look into implementing it to support heavy queries that may produce a lot of results.
Blazegraph has Triple Pattern Fragment (Linked Data Fragment) server implementation: https://github.com/hartig/BlazegraphBasedTPFServer
We should look into implementing it to support heavy queries that may produce a lot of results.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add configs for LDF server | operations/puppet | production | +5 -1 | |
Implement LDF server for Blazegraph | wikidata/query/rdf | master | +881 -8 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Smalyshev | T136358 Evaluate creating LDF server for WDQS | |||
Declined | Smalyshev | T91602 Search templates for linked data |
Change 317114 had a related patch set uploaded (by Smalyshev):
[WIP] Implement LDF server for Blazegraph
One thing about LDF that isn't really clear to me is who controls the available triple patterns. Our RDF mapping is quite complex, we would need to support a large number of patterns to allow efficient queries via LDF.
Would it be up to us to decide which patterns to support? Or do clients just request any patterns they are interested in, and we generate and cache the response on the fly?
My concern is that we would either need to cache a huge number of fragments to allow efficient queries.
I also don't quite see how fragments can be paged in the absence of a unique key. The spec does not say anything about this, as far as I can see.
Or do clients just request any patterns they are interested in, and we generate and cache the response on the fly
Yes. Just as with SPARQL queries, only triple pattern queries are really really basic.
we would either need to cache a huge number of fragments to allow efficient queries.
Triple pattern queries should be very fast since they go directly against indexes and aren't supposed to require any calculations. We'll see of course if any performance issues arise.
I also don't quite see how fragments can be paged in the absence of a unique key
I'm not sure how paging is implemented internally. But again, since triple patterns are pretty much reading the index, I don't think it should be too problematic.
The queries are trivial, but the result sets are potentially very very large. That's what worries me.
If I understand correctly, we'll be generating and caching each tripe in blazegraph multiple times - once for every triple pattern that matches it.
Adding Darian and Faidon, since this has potential ops impact due to large amounts of data to be cached and served.
The queries are trivial, but the result sets are potentially very very large.
True, but also true for SPARQL queries. People do million-item queries right now. With LDF they at least get proper paging and not bring down the server while doing it, hopefully.
If I understand correctly, we'll be generating and caching each tripe in blazegraph multiple times
Not sure what you mean by this. The query result will of course be generated anew for each different query - this is true for any query. It also would be cached by varnish - this is also true for any query, and can be configured. Since query result is naturally paged, how much data is cached will depend on the query (just as in SPARQL case) and the client actually consuming the data. I imagine varnish is supposed to be able to handle such cases, but if not we can change caching parameters to make it easier for varnish.
I think we have different understandings of how the paging works. It would be good to have clarity on this.
As far as I understand, for a pattern like ( ? P31 Q5 ), the *entire* result set would be generated before paging applies, and all pages of the set would be cached preemptively. That's a lot of data to generate and store, especially since there will be a lot of such P31 datasets.
Maybe I'm wrong, and the query is run with a LIMIT and ORDER, and a continuation based on the the subject ID. That would of course be much better. Can you confirm that this is actually the case?
the *entire* result set would be generated before paging applies,
I don't think this is what happens, looking at the code, at least not if I understand "generated" correctly.
See the implementation at:
https://github.com/wikimedia/wikidata-query-rdf/tree/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/ldf
Specifically BlazegraphBasedTPF and Blazegraph iterators it uses. The code there is kind of complex but it doesn't look like it does what you suggest it does, at least if I understand it right.
and all pages of the set would be cached preemptively
Not sure which cache do you mean here.
and the query is run with a LIMIT and ORDER,
It's not SPARQL query, so it's not run this way, but the iterator it uses does use limit and offset. I'm not sure what happens exactly, I can ask, but doesn't look like it produces the whole data set in any meaning I can think of.
@Smalyshev Thanks for checking! I'm satisfied as long as it doesn't naively dump the whole result somewhere to then chop it into pages.
Change 317282 had a related patch set uploaded (by Smalyshev):
Add configs for LDF server