Evaluate creating LDF server for WDQS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	May 26 2016, 8:17 PM

Description

Blazegraph has Triple Pattern Fragment (Linked Data Fragment) server implementation: https://github.com/hartig/BlazegraphBasedTPFServer

We should look into implementing it to support heavy queries that may produce a lot of results.

Details

	Subject	Repo	Branch	Lines +/-
	Add configs for LDF server	operations/puppet	production	+5 -1
	Implement LDF server for Blazegraph	wikidata/query/rdf	master	+881 -8

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Smalyshev	T136358 Evaluate creating LDF server for WDQS
		Declined		Smalyshev	T91602 Search templates for linked data

Event Timeline

Smalyshev created this task.May 26 2016, 8:17 PM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptMay 26 2016, 8:17 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Smalyshev moved this task from Incoming to Need investigation on the Wikidata-Query-Service board.May 26 2016, 8:17 PM

Smalyshev mentioned this in T130799: provide sparql results as text/turtle.

Smalyshev triaged this task as Medium priority.Sep 12 2016, 10:44 PM

Smalyshev added a subtask: T91602: Search templates for linked data.Sep 22 2016, 8:53 PM

Smalyshev updated the task description. (Show Details)Oct 20 2016, 10:11 PM

Change 317114 had a related patch set uploaded (by Smalyshev):
[WIP] Implement LDF server for Blazegraph

https://gerrit.wikimedia.org/r/317114

gerritbot added a project: Patch-For-Review.Oct 21 2016, 5:13 AM

Change 317114 merged by jenkins-bot:
Implement LDF server for Blazegraph

https://gerrit.wikimedia.org/r/317114

One thing about LDF that isn't really clear to me is who controls the available triple patterns. Our RDF mapping is quite complex, we would need to support a large number of patterns to allow efficient queries via LDF.

Would it be up to us to decide which patterns to support? Or do clients just request any patterns they are interested in, and we generate and cache the response on the fly?

My concern is that we would either need to cache a huge number of fragments to allow efficient queries.

I also don't quite see how fragments can be paged in the absence of a unique key. The spec does not say anything about this, as far as I can see.

Or do clients just request any patterns they are interested in, and we generate and cache the response on the fly

Yes. Just as with SPARQL queries, only triple pattern queries are really really basic.

we would either need to cache a huge number of fragments to allow efficient queries.

Triple pattern queries should be very fast since they go directly against indexes and aren't supposed to require any calculations. We'll see of course if any performance issues arise.

I also don't quite see how fragments can be paged in the absence of a unique key

I'm not sure how paging is implemented internally. But again, since triple patterns are pretty much reading the index, I don't think it should be too problematic.

In T136358#2825808, @Smalyshev wrote:

Triple pattern queries should be very fast since they go directly against indexes and aren't supposed to require any calculations. We'll see of course if any performance issues arise.

The queries are trivial, but the result sets are potentially very very large. That's what worries me.

If I understand correctly, we'll be generating and caching each tripe in blazegraph multiple times - once for every triple pattern that matches it.

Adding Darian and Faidon, since this has potential ops impact due to large amounts of data to be cached and served.

Also adding Krinkle.

The queries are trivial, but the result sets are potentially very very large.

True, but also true for SPARQL queries. People do million-item queries right now. With LDF they at least get proper paging and not bring down the server while doing it, hopefully.

If I understand correctly, we'll be generating and caching each tripe in blazegraph multiple times

Not sure what you mean by this. The query result will of course be generated anew for each different query - this is true for any query. It also would be cached by varnish - this is also true for any query, and can be configured. Since query result is naturally paged, how much data is cached will depend on the query (just as in SPARQL case) and the client actually consuming the data. I imagine varnish is supposed to be able to handle such cases, but if not we can change caching parameters to make it easier for varnish.

• iecetcwcpggwqpgciazwvzpfjpwomjxn subscribed.Nov 28 2016, 9:19 AM

In T136358#2825885, @Smalyshev wrote:

Since query result is naturally paged, how much data is cached will depend on the query (just as in SPARQL case) and the client actually consuming the data.

I think we have different understandings of how the paging works. It would be good to have clarity on this.

As far as I understand, for a pattern like ( ? P31 Q5 ), the *entire* result set would be generated before paging applies, and all pages of the set would be cached preemptively. That's a lot of data to generate and store, especially since there will be a lot of such P31 datasets.

Maybe I'm wrong, and the query is run with a LIMIT and ORDER, and a continuation based on the the subject ID. That would of course be much better. Can you confirm that this is actually the case?

the *entire* result set would be generated before paging applies,

I don't think this is what happens, looking at the code, at least not if I understand "generated" correctly.
See the implementation at:
https://github.com/wikimedia/wikidata-query-rdf/tree/master/blazegraph/src/main/java/org/wikidata/query/rdf/blazegraph/ldf

Specifically BlazegraphBasedTPF and Blazegraph iterators it uses. The code there is kind of complex but it doesn't look like it does what you suggest it does, at least if I understand it right.

and all pages of the set would be cached preemptively

Not sure which cache do you mean here.

and the query is run with a LIMIT and ORDER,

It's not SPARQL query, so it's not run this way, but the iterator it uses does use limit and offset. I'm not sure what happens exactly, I can ask, but doesn't look like it produces the whole data set in any meaning I can think of.

@Smalyshev Thanks for checking! I'm satisfied as long as it doesn't naively dump the whole result somewhere to then chop it into pages.

Krinkle unsubscribed.Nov 28 2016, 11:09 PM

Lucie subscribed.Nov 30 2016, 8:07 AM

Change 317282 had a related patch set uploaded (by Smalyshev):
Add configs for LDF server

https://gerrit.wikimedia.org/r/317282

Smalyshev added a project: Discovery-Wikidata-Query-Service-Sprint.Dec 8 2016, 12:47 AM

Change 317282 merged by Gehel:
Add configs for LDF server

https://gerrit.wikimedia.org/r/317282

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 16 2016, 10:39 AM