Page MenuHomePhabricator

BlazeGraph Finalization: Pluggable inline values
Closed, ResolvedPublic

Description

We expect we'll have to write new inline values. We should validate that this works.

Event Timeline

Manybubbles raised the priority of this task from to Medium.
Manybubbles updated the task description. (Show Details)

Inline values are not necessary. They represent a tradeoff between dictionary encoding values and their direct representation as inline values within the statement indices. There is a simple example ColorsEnumExtension in com.bigdata.rdf.internal that illustrates how to do this for enumerated values. Similar approaches can be used for other things. I would also suggest looking at the DateTimeExtension class.

Inline values are not necessary. They represent a tradeoff between dictionary encoding values and their direct representation as inline values within the statement indices. There is a simple example ColorsEnumExtension in com.bigdata.rdf.internal that illustrates how to do this for enumerated values. Similar approaches can be used for other things. I would also suggest looking at the DateTimeExtension class.

Thanks! That is where I got the idea of potentially doing a inline value. Its not that we'll _need_ it but I want to know how well it works. I figure since dateTimes use this mechanism they are likely to work quite well. This is _somewhat_ wrapped up in our discussions of how to efficiently represent wikidata's values.

I agree that this is related to how you choose to represent, index, and
query values with additional annotations (error bounds, uncertainty,
different values at different points in time, etc.). This is not a simple
issue. One idea that I have seen is that the preferred values could be
indexed as ground statements (much as in the original data set that Peter
loaded). Those could be searched quite efficiently. But this becomes
problematic I think if you have multiple preferred values unless you then
hit the database again to pull out the metadata about those values.

We may need custom dateTime to represent that dreaded https://www.wikidata.org/wiki/Q1#P580

Yep, it looks like this value doesn't currently export to RDF correctly:

sparql
prefix wdq: <http://www.wikidata.org/entity/>
select ?x WHERE {
  wdq:Q1 wdq:P580s ?x
}

This means there exists a statement:

wdq:Q1 wdq:P580s wdq:Q1S789eef0c-4108-cdda-1a63-505cdd324564

Let's find out about wdq:Q1S789eef0c-4108-cdda-1a63-505cdd324564:

sparql
prefix wdq: <http://www.wikidata.org/entity/>
select ?x ?y WHERE {
  wdq:Q1S789eef0c-4108-cdda-1a63-505cdd324564 ?x ?y
}

That value for P580v looks interesting. Let's check it out:

sparql
prefix wdq: <http://www.wikidata.org/entity/>
select ?x ?y WHERE {
  wdq:VT392fa31586a0bde63ee928c91b586004 ?x ?y
}

It appears the Universe began in 1196 AD. That doesn't seem right, considering it's listed as 13798 million years BCE.

If we look into the Wikidata RDF dump, we see:

$ grep VT392fa31586a0bde63ee928c91b586004 wikidata-statements.nt 
<http://www.wikidata.org/entity/Q1S789eef0c-4108-cdda-1a63-505cdd324564> <http://www.wikidata.org/entity/P580v> <http://www.wikidata.org/entity/VT392fa31586a0bde63ee928c91b586004> .
<http://www.wikidata.org/entity/VT392fa31586a0bde63ee928c91b586004> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.wikidata.org/ontology#TimeValue> .
<http://www.wikidata.org/entity/VT392fa31586a0bde63ee928c91b586004> <http://www.wikidata.org/ontology#time> "-13800000000"^^<http://www.w3.org/2001/XMLSchema#gYear> .
<http://www.wikidata.org/entity/VT392fa31586a0bde63ee928c91b586004> <http://www.wikidata.org/ontology#timePrecision> "1"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://www.wikidata.org/entity/VT392fa31586a0bde63ee928c91b586004> <http://www.wikidata.org/ontology#preferredCalendar> <http://www.wikidata.org/entity/Q1985727> .

The statement indicates the date as the year -13,800,000,000. This seems reasonable, so let's investigate the data type, [XMLSchema#gYear](http://www.w3.org/TR/xmlschema-2/#gYear).

We see that the "value space of gYear is the set of Gregorian calendar years as defined in § 5.2.1 of ISO 8601", so let's theck out ISO 8601.

It appears that ISO 8601 years are restricted to four-digit years from 0000 to 9999. The standard allows for expansion outside this range, but it must be agreed upon by both the producer and the consumer of the data, so in this case it looks like Blazegraph is not expecting the actual value.

So I tried to make a quick hacky patch to com.bigdata.rdf.internal.impl.extensions.DateTimeExtension to make it use our previous model (date as long seconds) and with it it seems to properly understand the value, at least when experessed as:

wdt:P580 "-13800000000-01-01T00:00:00Z"^^xsd:dateTime ;

The query like:

prefix entity: <http://wikidata-wdq.testme.wmflabs.org/entity/>
prefix wdt: <http://wikidata-wdq.testme.wmflabs.org/entity/assert/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

select ?x WHERE {
  entity:Q1 wdt:P580 ?x
  FILTER (?x < "0001-01-01T00:00:00Z"^^xsd:dateTime)
}

Produces:

x
-13800000000-01-01T00:00:00Z

So two things left:

  1. Find proper (non-hacky) way of plugging it in
  2. (?) support other xsd: time types, such as gYear, etc.

Stas proved this works.