⚓ T88717 Investigate BlazeGraph aka BigData for WDQ

• Manybubbles created this task.Feb 5 2015, 7:22 PM

• Manybubbles raised the priority of this task from to Needs Triage.

• Manybubbles updated the task description. (Show Details)

• Manybubbles added a project: MediaWiki-Core-Team.

• Manybubbles moved this task to Backlog on the MediaWiki-Core-Team board.

• Manybubbles subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2015, 7:22 PM

• Manybubbles triaged this task as Medium priority.Feb 5 2015, 7:23 PM

• Manybubbles added a project: Wikidata-Query-Service.

• Manybubbles set Security to None.

• Manybubbles renamed this task from Investigate BigData to Investigate BigData for WDQ.Feb 5 2015, 7:28 PM

• Manybubbles moved this task from Backlog to In Dev/Progress on the MediaWiki-Core-Team board.Feb 10 2015, 8:44 PM

Thanks to Nik for chatting today. Here's a few of the key items we discussed. The slides are also attached.

systap-short-overview_20150210_commercial_options.pdf3 MBDownload

I will email you is a much more detailed technical overview presentation from the Semantic Technologies conference in August 2014 and updated for the 1.4 release made this fall. It includes HA, GPU, Blueprints, RDF Gather-Apply-Scatter (GAS), etc. Unfortunately, it is more than 10MB...

HA Overview: http://www.bigdata.com/whitepapers/HA-Replication-Cluster.pdf
HA Architecture: http://wiki.bigdata.com/wiki/index.php/HAJournalServer
See also.

http://www.bigdata.com/bigdata/whitepapers/semtech_ha_deck.pdf (Slide deck on the HA architecture)
- http://www.bigdata.com/whitepapers/Bigdata-HA-Quorum-Detailed-Design.pdf (Detailed design for the zookeeper integration)
- http://docs.google.com/presentation/d/1IdKQaBouV-a3Bjblk8PtXk-L29Rf_8VjInw7xWKwMDA/edit?usp=sharing (State Transitions for the HAJournalServer)

The use of Reification Done Right (RDR) as an extension for arbitrary statements about properties: http://wiki.bigdata.com/wiki/index.php/Reification_Done_Right.

In T88717#1029870, @Beebs.systap wrote:

Thanks to Nik for chatting today. Here's a few of the key items we discussed. The slides are also attached.
systap-short-overview_20150210_commercial_options.pdf3 MBDownload

Thanks for all of your time!

I will email you is a much more detailed technical overview presentation from the Semantic Technologies conference in August 2014 and updated for the 1.4 release made this fall. It includes HA, GPU, Blueprints, RDF Gather-Apply-Scatter (GAS), etc. Unfortunately, it is more than 10MB...

Ah!

HA Architecture: http://wiki.bigdata.com/wiki/index.php/HAJournalServer
See also.

http://www.bigdata.com/bigdata/whitepapers/semtech_ha_deck.pdf (Slide deck on the HA architecture)

http://www.bigdata.com/whitepapers/Bigdata-HA-Quorum-Detailed-Design.pdf (Detailed design for the zookeeper integration)

http://docs.google.com/presentation/d/1IdKQaBouV-a3Bjblk8PtXk-L29Rf_8VjInw7xWKwMDA/edit?usp=sharing (State Transitions for the HAJournalServer)

Do you have a link handy for the sharding solution or is that also in HAJournalServer? I'm not sure we'll need it right away but it'll be good to know its there.

The use of Reification Done Right (RDR) as an extension for arbitrary statements about properties: http://wiki.bigdata.com/wiki/index.php/Reification_Done_Right.

For those playing along at home this extension to RDF should allow us to fully represent Wikidata in RDF (well, with the extension) and it'd support SPARQL. I'm not really a fan of SPARQL as a language _but_ it has the advantage of, at least in the case of BigData, having a nice optimizer and exposing several extension points. Those extension points would allow us to make a language dependant property just like we did with Gremlin (the label one) and would allow us to rewrite properties like country to perform the appropriate traversals. Theoretically, at least.

Smalyshev claimed this task.Feb 11 2015, 11:37 PM

Smalyshev moved this task from Incoming to In Dev/Progress on the Wikidata-Query-Service board.Feb 12 2015, 6:02 AM

Here is the wikidata RDF demo that Peter Haase shared.

here is again the link to the showcase based on the workbench and bigdata:
http://grapphs.com:8087/
login: guest/guest

As I said, this was something put together rather quickly, but we are planning to extend it (also based on the discussion we had yesterday).
Then we can make the system accessible openly.
Your input regarding further use cases / what you would like see and do would be much appreciated.

Per Nik's email. Here is the information on the scale out architecture.

http://wiki.bigdata.com/wiki/index.php/ClusterGuide

See "Scale-out Cluster" on http://wiki.bigdata.com/

@Beebs.systap this looks pretty good. How it is done - i.e. what is used to create the triples, how they are imported, etc. - is this code available?

Also, I assume we'd want eventually to support qualifiers/references, i.e. queries like "countries list by population, from largest to smallest" taking into account US has a number of population figures and we'd have to take the latest, or for "female mayors" - we may have to account for the fact that some mayorships could be in the past - i.e. Berlin (https://www.wikidata.org/wiki/Q64) had a lot of mayors, but only one (Michael Müller) is the current mayor, so we'd have to be able to support it. Fortunately, this particular mayor also is marked with "preferred" flag - which we may need to support too - but not all data has preferred flags, so we may need to rely on time qualifiers. Next step would be the same in any point of time (i.e. "female mayors in 20th century").

For references, the query may be "give me all data about Douglas Adams (Q42) that come from Encyclopædia Britannica Online (Q5375741)".

Hey! Does BigData support GeoSpatial queries? I see https://github.com/varunshaji/bigdata-geosparql but I'm not sure how well supported it is.

In T88717#1036699, @Manybubbles wrote:

Hey! Does BigData support GeoSpatial queries? I see https://github.com/varunshaji/bigdata-geosparql but I'm not sure how well supported it is.

We have users in the USG with this use case. Check out: http://www.blazegraph.com/whitepapers/bigdata_geospatial.pdf

In T88717#1036715, @Beebs.systap wrote:

In T88717#1036699, @Manybubbles wrote:

Hey! Does BigData support GeoSpatial queries? I see https://github.com/varunshaji/bigdata-geosparql but I'm not sure how well supported it is.

We have users in the USG with this use case. Check out: http://www.blazegraph.com/whitepapers/bigdata_geospatial.pdf

Cool! Is that the linked github project?

Actually, the github was a first heard for us! That was for another project.

In T88717#1034987, @Smalyshev wrote:

@Beebs.systap this looks pretty good. How it is done - i.e. what is used to create the triples, how they are imported, etc. - is this code available?

Also, I assume we'd want eventually to support qualifiers/references, i.e. queries like "countries list by population, from largest to smallest" taking into account US has a number of population figures and we'd have to take the latest, or for "female mayors" - we may have to account for the fact that some mayorships could be in the past - i.e. Berlin (https://www.wikidata.org/wiki/Q64) had a lot of mayors, but only one (Michael Müller) is the current mayor, so we'd have to be able to support it. Fortunately, this particular mayor also is marked with "preferred" flag - which we may need to support too - but not all data has preferred flags, so we may need to rely on time qualifiers. Next step would be the same in any point of time (i.e. "female mayors in 20th century").

For references, the query may be "give me all data about Douglas Adams (Q42) that come from Encyclopædia Britannica Online (Q5375741)".

Peter Haas has been working it. Here are his comments.

"For this showcase, the simplified RDF dump from http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/ (i.e. the dump without statement qualifiers) has been loaded into bigdata,
Bigdata is also able to handle the qualifiers available in the complete dump, either based on the original reification model used in the Wikidata RDF dump as described here http://meta.wikimedia.org/wiki/Wikidata/Development/RDF#Statements_with_qualifiers, or based on Bigdata’s own native reification model RDR as described here: http://www.bigdata.com/rdr
We can extend the showcase in this regard."

He is still working on the reified version using our RDR support.

So I had a look at http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-statements.nt.gz and I actually understand it now. I'm starting a write up of my learning RDF notes. At first glance I really didn't like SPARQL but I'm coming around to it.

Anyway, it looks like that wikidata-statements file implements rdf paper. In fact I think its made by the same folks.

So, getting from what is in wikidata statements to the kind of simple queries you'd want to use in most cases seems like its really the problem with using RDF. An making those simple queries efficient. I want to be able to write:

SELECT ?politician, ?spouse
WHERE {
  ?politician wd:P106 Q82955 ;
                 wd:P26 ?spouse ;
                 wd:P509 wd:Q356405 .
}

and get all the politicians who died of blood letting and all of their spouses. wikidata-statements goes through many more steps and wikidata-simplified-statements doesn't typically have data about spouses because spouses are qualified with start and end times.

We can obviously hack on the code that makes wikidata-statements. We could use RDR in some or all of these cases. We could use something like singleton properties instead, even.

Is there a way to to traversals and aggregation in SPARQL? I.e. Cypher examples:

This is list of all professions. Note the '*' there:

MATCH (v:item {wikibaseId: 'Q28640'})<-[:claim|P279|P31*]-(v2:item) RETURN v2.wikibaseId, v2.labelEn;

This is list of countries by latest population data:

MATCH (v:item)-[:claim]->(c:claim:P31 {value: "Q6256"}) 
	MATCH (v)-[:claim]->(c2:claim:P1082) WHERE has(c2.value) 
	WITH v as v, max(c2.P585q) as latest
	MATCH (v)-[:claim]->(cv:claim:P1082)
	WHERE cv.P585q = latest
	RETURN v.wikibaseId, v.labelEn, cv.value, cv.P585q
	ORDER BY cv.value DESC

I wonder how these would look like with SPARQL.

In T88717#1037229, @Smalyshev wrote:
Is there a way to to traversals and aggregation in SPARQL? I.e. Cypher examples:

This is list of all professions. Note the '*' there:
MATCH (v:item {wikibaseId: 'Q28640'})<-[:claim|P279|P31*]-(v2:item) RETURN v2.wikibaseId, v2.labelEn;
This is list of countries by latest population data:
MATCH (v:item)-[:claim]->(c:claim:P31 {value: "Q6256"}) 
	MATCH (v)-[:claim]->(c2:claim:P1082) WHERE has(c2.value) 
	WITH v as v, max(c2.P585q) as latest
	MATCH (v)-[:claim]->(cv:claim:P1082)
	WHERE cv.P585q = latest
	RETURN v.wikibaseId, v.labelEn, cv.value, cv.P585q
	ORDER BY cv.value DESC
I wonder how these would look like with SPARQL.

From Blazegraph/Bigdata developer Mike Personick:

Traversals = Property Paths

http://www.w3.org/TR/sparql11-query/#propertypaths

Lets you do * (0 or more traversal) and + (1 or more)

Also we have GAS-based BFS.

Aggregation yes, lots of support for aggregation in Sparql 1.1:

http://www.w3.org/TR/sparql11-query/#aggregates

Can do all sorts of things - min, max, avg, count, ...

Thanks! We found the answer about an hour ago and didn't update the ticket in time. Thanks for the reference pages!

It's also worth checking out the RDF GAS API that Mike references. You can execute graph analytics within the SPARQL queries. It's bundled with BFS, SSP, Connected Components, and Page Rank, but they can also be extended.

http://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API

Is there a standard way to load a huge amount of RDF data into Bigdata? I tried the following (with a 3GB gzipped .nt file), but it very quickly blew the heap:

Repository repo = BigdataSailFactory.connect("localhost", 9999);
RepositoryConnection con = repo.getConnection();
File file = new File("/home/james/dumps/wikidata-statements.nt.gz");
FileInputStream fileInputStream = new FileInputStream(file);
GZIPInputStream gzipInputStream = new GZIPInputStream(fileInputStream);
con.add(gzipInputStream, null, RDFFormat.N3);

EDIT: I cranked up the heap, and ran into the max array length limitation:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.util.Arrays.copyOf(Arrays.java:2271)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
	at info.aduna.io.IOUtil.transfer(IOUtil.java:494)
	at info.aduna.io.IOUtil.readBytes(IOUtil.java:210)
	at com.bigdata.rdf.sail.webapp.client.RemoteRepository$AddOp.prepareForWire(RemoteRepository.java:1492)
	at com.bigdata.rdf.sail.webapp.client.RemoteRepository$AddOp.access$000(RemoteRepository.java:1436)
	at com.bigdata.rdf.sail.webapp.client.RemoteRepository.add(RemoteRepository.java:890)
	at com.bigdata.rdf.sail.remote.BigdataSailRemoteRepositoryConnection.add(BigdataSailRemoteRepositoryConnection.java:663)
	at com.bigdata.rdf.sail.remote.BigdataSailRemoteRepositoryConnection.add(BigdataSailRemoteRepositoryConnection.java:648)
	at example.bigdata_client.App.update(App.java:33)
	at example.bigdata_client.App.main(App.java:23)

These errors are all client-side -- the servlet appears to be humming along. Is there a preferred streaming API to use?

Looks like we're going to have trouble with some dates too. xsd:dateTime supports 13798 million years BCE but I think BigData will have trouble with it what with this comment from DateTimeExtension:

/**
 * This implementation of {@link IExtension} implements inlining for literals
 * that represent xsd:dateTime literals.  These literals will be stored as time 
 * in milliseconds since the epoch.  The milliseconds are encoded as an inline 
 * long.
 */

Not that I've had a chance to test it yet. It'd be one of the first things imported during a full import of the statements. I see in the RDF dump on labs its actually a xsd:gYear type though.

In T88717#1037712, @Jdouglas wrote:

Is there a standard way to load a huge amount of RDF data into Bigdata? I tried the following (with a 3GB gzipped .nt file), but it very quickly blew the heap:

Repository repo = BigdataSailFactory.connect("localhost", 9999);
RepositoryConnection con = repo.getConnection();
File file = new File("/home/james/dumps/wikidata-statements.nt.gz");
FileInputStream fileInputStream = new FileInputStream(file);
GZIPInputStream gzipInputStream = new GZIPInputStream(fileInputStream);
con.add(gzipInputStream, null, RDFFormat.N3);

EDIT: I cranked up the heap, and ran into the max array length limitation:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.util.Arrays.copyOf(Arrays.java:2271)
	at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
	at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
	at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
	at info.aduna.io.IOUtil.transfer(IOUtil.java:494)
	at info.aduna.io.IOUtil.readBytes(IOUtil.java:210)
	at com.bigdata.rdf.sail.webapp.client.RemoteRepository$AddOp.prepareForWire(RemoteRepository.java:1492)
	at com.bigdata.rdf.sail.webapp.client.RemoteRepository$AddOp.access$000(RemoteRepository.java:1436)
	at com.bigdata.rdf.sail.webapp.client.RemoteRepository.add(RemoteRepository.java:890)
	at com.bigdata.rdf.sail.remote.BigdataSailRemoteRepositoryConnection.add(BigdataSailRemoteRepositoryConnection.java:663)
	at com.bigdata.rdf.sail.remote.BigdataSailRemoteRepositoryConnection.add(BigdataSailRemoteRepositoryConnection.java:648)
	at example.bigdata_client.App.update(App.java:33)
	at example.bigdata_client.App.main(App.java:23)

These errors are all client-side -- the servlet appears to be humming along. Is there a preferred streaming API to use?

The way to do it is use the SPARQL LOAD and provide the URI for the RDF data file.

For the REST API with URIs see http://wiki.blazegraph.com/wiki/index.php/NanoSparqlServer#INSERT_RDF_.28POST_with_URLs.29.

Through the workbench or via SPARQL you could do something like:

LOAD <file:///home/james/dumps/wikidata-statements.nt.gz>

@Beebs.systap, is this still true (from old blog post):
For example, this can happen if your query has a large result set and uses ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization of the query result set even if you use OFFSET/LIMIT.

It'd be wonderful if the optimizer had the option of walking an index to make materialization not required. Assuming that is actually more efficient. Is there a way to limit the number of results that are materialized before any actual order/limit/offset operation? In our case we'd probably want to just tell folks that their query isn't selective enough to allow order/limit/offset rather than keep working on a very slow query.

@Beebs.systap, I can't log in to your wiki with Google. It says "OpenID auth request contains an unregistered domain: http://wiki.blazegraph.com/wiki". I imagine that has something to do with the new domain name.

In T88717#1037914, @Manybubbles wrote:

@Beebs.systap, I can't log in to your wiki with Google. It says "OpenID auth request contains an unregistered domain: http://wiki.blazegraph.com/wiki". I imagine that has something to do with the new domain name.

Yes, working on that now. The other one is still up. wiki.bigdata.com. Try that for now.

In T88717#1037721, @Manybubbles wrote:
Looks like we're going to have trouble with some dates too. xsd:dateTime supports 13798 million years BCE but I think BigData will have trouble with it what with this comment from DateTimeExtension:
/**
 * This implementation of {@link IExtension} implements inlining for literals
 * that represent xsd:dateTime literals.  These literals will be stored as time 
 * in milliseconds since the epoch.  The milliseconds are encoded as an inline 
 * long.
 */
Not that I've had a chance to test it yet. It'd be one of the first things imported during a full import of the statements. I see in the RDF dump on labs its actually a xsd:gYear type though.

From Bryan Thompson.

The native xsd:dateTime support is based on an int64 value. It does support negative int64 values, which is what a date before the epoch is translated into. When using the xsd:dateTime inlining, what happens is that dates are not entered into the dictionary. They appear as inline values within the statement indices instead. This avoids a dictionary lookup for date materialization. It also lets us use the OSP index for key-range scans on xsd:dateTIme values.

If you need to go beyond an int64 value, then the graph database can also inline xsd:integer values (BigInteger). This would allow general cosmology dates.

One limitation of this approach (which is completely optional and which can be disabled using AbstractTripleStore.Options.INLINE_DATE_TIMES) is that the dateTime is converted to a point value - the timezone information needs to be normalized. This would also be true of any custom inlining scheme developed for xsd:integer rather than xsd:long. The general problem is that xsd:dateTime specifies two dimensions (a point in time and a timezone) and allows the timezone to be optional. Oops. There is no way to translate that into a single point, which is what you need to be able to compare values in an index, do key-range scans, etc.

You have some options. You can disable dateTime inlining. This will preserve all information. You can store both the non-inline version (with timezone information) and the inline version (by using different predicates). This would preserve the opportunity for key range scans on date while also preserving timezone information. And if necessary you could use an alternative inlining scheme for dates in the extreme past or future.

Bryan

@Beebs.systap thanks! It's working with:

LOAD <file:///home/james/dumps/wikidata-statements.nt.gz>

In T88717#1037925, @Beebs.systap wrote:

From Bryan Thompson.

The native xsd:dateTime support is based on an int64 value. It does support negative int64 values, which is what a date before the epoch is translated into. When using the xsd:dateTime inlining, what happens is that dates are not entered into the dictionary. They appear as inline values within the statement indices instead. This avoids a dictionary lookup for date materialization. It also lets us use the OSP index for key-range scans on xsd:dateTIme values.

If you need to go beyond an int64 value, then the graph database can also inline xsd:integer values (BigInteger). This would allow general cosmology dates.

One limitation of this approach (which is completely optional and which can be disabled using AbstractTripleStore.Options.INLINE_DATE_TIMES) is that the dateTime is converted to a point value - the timezone information needs to be normalized. This would also be true of any custom inlining scheme developed for xsd:integer rather than xsd:long. The general problem is that xsd:dateTime specifies two dimensions (a point in time and a timezone) and allows the timezone to be optional. Oops. There is no way to translate that into a single point, which is what you need to be able to compare values in an index, do key-range scans, etc.

You have some options. You can disable dateTime inlining. This will preserve all information. You can store both the non-inline version (with timezone information) and the inline version (by using different predicates). This would preserve the opportunity for key range scans on date while also preserving timezone information. And if necessary you could use an alternative inlining scheme for dates in the extreme past or future.

Bryan

Thanks! I imagine we'll think of somethiing. Our data types actually have a bunch of data: https://www.wikidata.org/wiki/Special:ListDatatypes
Our dateTimes actually have precision and, optionally, error bounds. Quantities are pretty similar.

It might make sense for us to develop something to allow those to be inlined. Right now the Wikidata Toolkit dumps them as several triples (one for value, one for precision, one for positive error and one for negative error).

Any chance of a link to how to use geo stuff? http://www.blazegraph.com/whitepapers/bigdata_geospatial.pdf is nice but I'd be more interested in something that goes into the kind of depth your wiki goes into on other subjects. If you don't have an article (I couldn't find one by searching) then I'm happy to read code if you want to point me to a file.

Also, I can't edit the wiki. I found a spelling mistake and was trying to help!

In T88717#1037902, @Manybubbles wrote:

@Beebs.systap, is this still true (from old blog post):
For example, this can happen if your query has a large result set and uses ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization of the query result set even if you use OFFSET/LIMIT.

It'd be wonderful if the optimizer had the option of walking an index to make materialization not required. Assuming that is actually more efficient. Is there a way to limit the number of results that are materialized before any actual order/limit/offset operation? In our case we'd probably want to just tell folks that their query isn't selective enough to allow order/limit/offset rather than keep working on a very slow query.

From Bryan T.

DISTINCT does NOT force total materialization. That is an error. DISTINCT is a streaming operator.

ORDER BY generally does force total materialization since the order needs to be respected. You have a few options here.

We have (and can) optimize things when the ORDER BY corresponds to a natural order of an index. For example, ORDER BY ?date when ?date is an xsd:dateTime and is inlined into the OSP index. General approaches to rewrites of such queries are certainly possible, but not all queries are amenable to such a rewrite.

If you do not need total order, then you can push the SELECT down into a sub-SELECT and put a LIMIT on that sub-SELECT. You can then put an ORDER BY in the outer SELECT. This approach lets you sort the first N solutions. For example, you might limit yourselves to sorting the first 1000 solutions. However, there might be many more than N solutions in which case this begins to look like a random sampling.

You can not use ORDER BY when you want to avoid total materialization. ORDER BY really requests materialization. If you don't want it, don't ask for it.

In general, the query plan generator tries to produce query plans that are left-to-right and that do not require any intermediate result sets to be materialized in hash joins. If you have queries that are simple conjunctive join plans, then you get your solutions in some unspecified (and unstable) order and the time to the first solution is very fast.

If you are going to run queries with really large intermediate result sets, then you should use the analytic query mode (just check the box under advanced options or specify the URL Query parameter - see the NanoSparqlServer wiki page). This will put the hash indices for any solution set joins onto the native C process heap rather than the managed java object heap. This is supported for nearly all operators - ORDER BY is actually the exception. However, a native heap ORDER BY could be implemented. The trick is to partition the value space of the ORDER BY based on the different data types and their ordering semantics. This can then be turned into an external memory sort for each data type and a merge sort across all of those external memory sorted subsets.

Update from SYSTAP/Metaphacts.

We are reloading the wikidata into the demo application shown to include provenance data. This should be ready in a day or so.

In T88717#1037986, @Beebs.systap wrote:

In T88717#1037902, @Manybubbles wrote:

@Beebs.systap, is this still true (from old blog post):
For example, this can happen if your query has a large result set and uses ORDER BY or DISTINCT. Both ORDER BY and DISTINCT force total materialization of the query result set even if you use OFFSET/LIMIT.

It'd be wonderful if the optimizer had the option of walking an index to make materialization not required. Assuming that is actually more efficient. Is there a way to limit the number of results that are materialized before any actual order/limit/offset operation? In our case we'd probably want to just tell folks that their query isn't selective enough to allow order/limit/offset rather than keep working on a very slow query.

From Bryan T.

DISTINCT does NOT force total materialization. That is an error. DISTINCT is a streaming operator.

ORDER BY generally does force total materialization since the order needs to be respected. You have a few options here.

We have (and can) optimize things when the ORDER BY corresponds to a natural order of an index. For example, ORDER BY ?date when ?date is an xsd:dateTime and is inlined into the OSP index. General approaches to rewrites of such queries are certainly possible, but not all queries are amenable to such a rewrite.

If you do not need total order, then you can push the SELECT down into a sub-SELECT and put a LIMIT on that sub-SELECT. You can then put an ORDER BY in the outer SELECT. This approach lets you sort the first N solutions. For example, you might limit yourselves to sorting the first 1000 solutions. However, there might be many more than N solutions in which case this begins to look like a random sampling.

You can not use ORDER BY when you want to avoid total materialization. ORDER BY really requests materialization. If you don't want it, don't ask for it.

In general, the query plan generator tries to produce query plans that are left-to-right and that do not require any intermediate result sets to be materialized in hash joins. If you have queries that are simple conjunctive join plans, then you get your solutions in some unspecified (and unstable) order and the time to the first solution is very fast.

If you are going to run queries with really large intermediate result sets, then you should use the analytic query mode (just check the box under advanced options or specify the URL Query parameter - see the NanoSparqlServer wiki page). This will put the hash indices for any solution set joins onto the native C process heap rather than the managed java object heap. This is supported for nearly all operators - ORDER BY is actually the exception. However, a native heap ORDER BY could be implemented. The trick is to partition the value space of the ORDER BY based on the different data types and their ordering semantics. This can then be turned into an external memory sort for each data type and a merge sort across all of those external memory sorted subsets.

Thanks again! Wonderful. We're evaluating exposing some kind of query directly over a public API. For a while it looked like SPARQL was safe enough that we could simply slap a timeout on it and pass it through. In this case, at least, it looks like we'd have to do some munging (pushing stuff to a subquery with a limit, probably). Looks like we could do that munging on the server side with an ASTOptimizer.

Thanks for http://wiki.bigdata.com/wiki/index.php/QueryEvaluation, btw. It was a good read.

Speaking of dates, I see this from the server:

[java] ERROR: LexiconConfiguration.java:618: "1895-02-29" is not a valid representation of an XML Gregorian Calendar value.: value=1895-02-29

Does this smell of a mischaracterization in the RDF I'm importing?

Also, thanks for documenting your code so well. Its really a nice read.

Can SPARQL take gzipped files? I tried giving gzipped URL on bigdata workbench and it didn't work.

@Jdouglas wikidata has lots of broken dates. February 29 is not the worst - at least it could have happened. We also have ones with February 31.

@Beebs.systap, can the query planner use two indexes at once to identify vertexes if it decides that that would be more efficient than using a single one? Say I have
SELECT ?x
WHERE {

?x a ex:event
?x ex:startTime ?startTime
?x ex:endTime ?endTime
FILTER (xsd:dateTime("1999-12-31T00:00:00Z" < ?startTime && ?endtime < xsd:dateTime("2000-01-02T00:00:00Z")

}

Does the concept of vertex centric indexes even apply to RDF? I suppose it would if the triple store kept some kind of adjacency list of properties and edges but I don't _think_ that is a thing with BigData, right?

Another question for @Beebs.systap, is there a way to iterate _all_ results for a large analytic query? Is there a way to return part of a large result set and then continue where you left off? Does that way require some kind of server side resource like a cursor?

Essentially, we have two requirements. The immediate term one is to be able to dump result lists of tens of thousands of results to some file somewhere. Later on we'd like to be able to allow users to write queries and iterate across them slowly. See the MediaWiki continue api for an example. Are those things possible?

In T88717#1038185, @Smalyshev wrote:

Can SPARQL take gzipped files? I tried giving gzipped URL on bigdata workbench and it didn't work.

What issue did you encounter and what command did you run? I ran it via the Workbench just now and it seemed OK.

LOAD <file:///Users/beebs/Documents/systap/demo-data/as-2014/as_2014_ytd.rdf.ttl.gz>

Is there a temporary issue with the LDAP authentication for mediawiki accounts? Bryan created an account, but the login isn't working. I checked and it wasn't working for me either (on a new session).

Haasepeter subscribed.Feb 14 2015, 8:21 PM

On Tuesday I had shown you some sample queries and visualizations (http://www.grapphs.com:8087) based on the unqualified/simple RDF statements of the Wikidata RDF export.
The question came up how to query the qualified statements.
I have also loaded the qualified statements from http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/
As an example, please see http://www.grapphs.com:8087/resource/Q42
that answers the question from above "give me all data about Douglas Adams (Q42) that came from Encyclopædia Britannica Online (Q5375741)".

Note that the join conditions become rather difficult because of the peculiar reification model chosen by the wikidata RDF export.
This can be done more compact using RDR.

In T88717#1039455, @Haasepeter wrote:

Note that the join conditions become rather difficult because of the peculiar reification model chosen by the wikidata RDF export.
This can be done more compact using RDR.

I'd love to chat about that model and what would actually make more sense from a BigData perspective at some point. I'm certainly not experienced enough to know that their reification model is peculiar. Its certainly expansive, but I don't know any better.

I'd be happy to have that discussion. We can setup another web conference if you like.

Btw, with peculiar I did not mean bad or I would know better. The wikidata reification model is likely the best possible solution within the RDF data model.
RDR did not exist at the time. I will work on implementing RDR as a serialization format for Wikidata.
I have good contacts to the project leader of the Wikidata Toolkit (he was a colleague of mine), I can work with him.

Oh cool! We're also willing to do that. It might make sense to have the
hangout with us and the wikidata toolkit leader. I'm on central Europe time
this week. In Berlin.

Nik, which dates/time would be be good for you to talk?
I can arrange a web conference.

@Haasepeter - Stas and I are in Berlin now and should be able to talk pretty much any time during the day there. Next week I'm pretty free as well, just send me/us an invite. If you don't have Stas' contact info I'll forward it to him as well.

@Beebs.systap - was reviewing code and saw some documentation typos. What is your process for submitting patches?

In T88717#1045422, @Manybubbles wrote:

@Haasepeter - Stas and I are in Berlin now and should be able to talk pretty much any time during the day there. Next week I'm pretty free as well, just send me/us an invite. If you don't have Stas' contact info I'll forward it to him as well.

@Beebs.systap - was reviewing code and saw some documentation typos. What is your process for submitting patches?

Best way is to get a CLA in-place: http://www.systap.com/contribute. We can then setup you up with access in Git.

Regarding the RDR/Wikdata Toolkit discussion: Markus is not available this week, he will propose a time for a call next week.

bd808 subscribed.Feb 19 2015, 6:08 PM

Out-of-band note: I've updated the docs in the inference-rules example to cover my entailment/inference/truth-maintenance findings so far.

• Jdouglas updated the task description. (Show Details)Feb 19 2015, 11:15 PM

• Manybubbles moved this task from In Dev/Progress to Backlog on the MediaWiki-Core-Team board.Feb 20 2015, 10:25 AM

• Manybubbles renamed this task from Investigate BigData for WDQ to Investigate BlazeGraph aka BigData for WDQ.Feb 20 2015, 10:28 AM

• Manybubbles moved this task from In Dev/Progress to Incoming on the Wikidata-Query-Service board.

In T88717#1046235, @Beebs.systap wrote:

In T88717#1045422, @Manybubbles wrote:

@Beebs.systap - was reviewing code and saw some documentation typos. What is your process for submitting patches?

Best way is to get a CLA in-place: http://www.systap.com/contribute. We can then setup you up with access in Git.

For those playing along at home this process is almost done. Legal approved of the corporate CLA, told me I could sign and I've done so. Now we're waiting for legal to send it in.

Also! I'm closing this task in favor of spinning up a new one with subtasks. We're well and truly out of the investigation phase and into the confirmation phase of this choice. Here is the task for confirmation: https://phabricator.wikimedia.org/T90101 . Yeah, its silly to have a different task for confirmation than for investigation, but I am silly.

Beebs.systap added a subscriber: • Thompsonbry.Feb 23 2015, 2:56 PM

Beebs.systap edited subscribers, added: Thompsonbry.systap; removed: • Thompsonbry.

I've received the signed CLA from corporate. Once I get Nik's SF account I will set him up as a developer.

In T88717#1058980, @Thompsonbry.systap wrote:

I've received the signed CLA from corporate. Once I get Nik's SF account I will set him up as a developer.

This is done. I now have to recover from travel sickness sufficiently to submit a patch.

I've resolved this task in favor of getting answers to the issues grouped under T90101.

bd808 moved this task from Backlog to Done on the MediaWiki-Core-Team board.Feb 23 2015, 9:59 PM

bd808 moved this task from Done to Archive on the MediaWiki-Core-Team board.Feb 23 2015, 11:57 PM

Smalyshev moved this task from Incoming to Done on the Wikidata-Query-Service board.Mar 9 2015, 8:51 PM

Investigate BlazeGraph aka BigData for WDQ
Closed, ResolvedPublic
Actions

Description

Related Objects

Event Timeline

	F39207: systap-short-overview_20150210_commercial_options.pdf
	Feb 11 2015, 3:55 AM

Investigate BlazeGraph aka BigData for WDQClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Investigate BlazeGraph aka BigData for WDQ
Closed, ResolvedPublic
Actions