Page MenuHomePhabricator

BlazeGraph uses old xsd:dateTime standard
Closed, ResolvedPublic

Description

Between XSD 1.0 and XSD 1.1 standards, the meaning of dates with year 0 and negative years changed. In XSD 1.0, year 0 is invalid, and year -1 is 1 BCE. In XSD 1.1, following ISO 8601:2000, year 0 is valid and means 1 BCE, year -1 is 2 BCE.

Judging from these tests:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( year("0000-01-01T00:00:00"^^xsd:dateTime) AS ?date)
}

MalformedQueryException: "0000-01-01T00:00:00" is not a valid representation of an XML Gregorian Calendar value.
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( "0001-01-01T00:00:00"^^xsd:dateTime - "-0001-01-01T00:00:00"^^xsd:dateTime AS ?date)
}

366.0

Blazegraph follows XSD 1.0. Current RDF spec specifies that it uses XSD 1.1. So we probably need an option to support XSD 1.1.

See also discussion in T94064.

Event Timeline

Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)

Peter comments that he has also run into this just recently.

I think that we should create two tickets

  1. The xsd date time specification change. This will need to get

documented at the data migration page. We should also modify the service
description output to indicate what version of RDF, XSD, etc. are supported.

  1. A ticket for RDF 1.1 compliance. Openrdf 2.8 targets RDF 1.1, but

Blazegraph is still on RDF 1.0. We should pull together a list of all of
the changes associated with RDF 1.1 and see if we can get there in a single
release.

Question: is it appropriate to support XSD 1.1 while we are on RDF 1.0?

Thanks,
Bryan

As far as I can tell Wikidata's statement to data contributors and users about the meaning of dates stored in Wikidata is at [[mw:Wikibase/DataModel/JSON]]. The [ earliest version] dated 29 April 2014 states "time: Date and time in ISO notation". The 2004 version of ISO 8601 defined 0000 as 1 BCE, contrary to XSD 1.0. Furthermore, [[en:astronomical year numbering]], which was probably the most common use of negative years before the late 20th century proliferation of computer standards, treats 0 as 1 BCE.

I believe we must presume that most data contributors have not read all the code that implements Wikibase, and just believed the statements, and the general public understanding of what the year 0 is. More importantly, they would understand year (-n) to be equal to year (n+1) BCE. What would get stored in wikibase would depend on how it was entered; the user interface would have changed an input of "5 BC" to -5, but using an API to enter "-00000000004-12-31T00:00:00Z" (which the data contributor would have regarded as equal to 31 December 5 BC) as-is.

This means that every year <= 1 AD in the database is suspect and must be reviewed for correctness.

@Jc3s5h @Smalyshev: I have created a separate ticket for figuring out negative years (and year zero) in our data model: T99674: Decide on internal representation of (Gregorian and Julian) dates with negative years

@daniel cool. Given that we now have our own date handler in Blazegraph, we can implement practically any decision, we just need to know what to do.

@Smalyshev is it correct to say that at the moment, Julian dates are converted to XSD 1.1 dates, but Gregorian dates stay XSD 1.0 in our output? I think we should have an option for switching the XSD version of the output, defaulting to 1.1.

@Smalyshev is it correct to say that at the moment, Julian dates are converted to XSD 1.1 dates, but Gregorian dates stay XSD 1.0 in our output? I think we should have an option for switching the XSD version of the output, defaulting to 1.1.

Obviously I'm not Smalyshev, but the only distinction discussed among these related Phabricator tasks is XSD 1.0 representing 1 BCE as -0001 while XSD 1.1 represents it as 0000. It would be helpful if someone would post links to the exact specifications for XSD 1.0 and 1.1 they are thinking of, but from what I found, XSD follows the Gregorian calendar because their notation is inspired by ISO 8601.

Leaving ISO 8601 and its spawn aside, I am not aware of any consistent notational differences between the (possibly proleptic) Julian calendar and Gregorian calendar, other than the leap days that only occur in the Julian. If a date stored in Wikidata and the calendarmodel field were set to Julian, I can't imagine any difference in output format one might want to make compared to the calendarmodel field being set to Gregorian.

An aside, just in case anyone decides to read a book I mentioned earlier, Dershowitz and Reingold's Calendrical Calculations. That book use the convention that Julian 1 BCE is -1 while Gregorian 1 BCE is 0. This is the only written work I have ever seen that follows this convention.

@Jc3s5h XSD is always Gregorian, but the question is which year numbering should be used in the output. This is regardless of whether the date is Gregorian or Julian internally. It's purely a question of which version of the XSD spec we want to follow in our output, and yes, the only difference is a one year offset.

@Jc3s5h XSD is always Gregorian, but the question is which year numbering should be used in the output. This is regardless of whether the date is Gregorian or Julian internally. It's purely a question of which version of the XSD spec we want to follow in our output, and yes, the only difference is a one year offset.

I don't know if we need to write carefully in this forum, but when the results of the discussion are presented to the community, it will be important (if we allow the output of Julian dates) to say that we produce an output format inspired by XSD 1.x but extended to allow Julian dates.

I don't know if we need to write carefully in this forum, but when the results of the discussion are presented to the community, it will be important (if we allow the output of Julian dates) to say that we produce an output format inspired by XSD 1.x but extended to allow Julian dates.

That is incorrect for the RDF output. It would be correct for the JSON representation used for internal storage, the API, and JSON dumps.

But JSON not what this ticket is about. This ticket is about the representation in RDF, which explicitly uses the XSD:data type, which in turn is defined to be an ISO timestamp, and thus always Gregorian. The problem described in the ticket is that the RDF store we used, BlazeGraph, uses XSD 1.0 internally, while we (probably) want to use XSD 1.1. Or at least have an option to switch RDF output between 1.0 and 1.1.

Am I correct to think that this ticket is only about how to represent 1 BCE in RDF. Thus, the fact that RDF is strictly (possibly proleptic) Gregorian while Wikidata also supports Julian is outside the scope of this ticket, so conversion between Gregorian and Julian is a problem for a different ticket?

I think this particular tickes is rather about how our Blazegraph instance understands xsd:datetime type. Right now it understands it strictly as XSD 1.0. We may want to have an option for it to also accept XSD 1.1. There are several aspect of this:

  1. How the data is stored in wikidata (T99674)
  2. How the data look in RDF - e.g. does 2BCE look like "-0001-01-01T00:00:00"^^xsd:dateTime (XSD 1.1) or "-0002-01-01T00:00:00"^^xsd:dateTime (XSD 1.0) (T99795)
  3. How SPARQL engine understand the data - i.e., how many days are between "0001-01-01T00:00:00"^^xsd:dateTime and "-0001-01-01T00:00:00"^^xsd:dateTime.

These three are distinct questions. I think this ticket is about the third one - though of course the previous two are important too and need to be handled.

Looks like this one is trickier than we thought. While Blazegraph allows to override storage for dates, it still insists on using Calendar for math. So if we do this:

prefix schema: <http://schema.org/>
INSERT {
  <http://test.com/a> schema:lastModified "0001-01-01T00:00:00"^^xsd:dateTime
  <http://test.com/a> schema:lastModified "-0001-01-01T00:00:00"^^xsd:dateTime
  <http://test.com/a> schema:lastModified "-13798000000-01-01T00:00:00"^^xsd:dateTime
  <http://test.com/b> schema:lastModified "0000-01-01T00:00:00"^^xsd:dateTime
} WHERE {}

everything works, and querying it with:

prefix schema: <http://schema.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?a ?b 
WHERE {
  <http://test.com/a> schema:lastModified ?a .
  <http://test.com/b> schema:lastModified ?b .

}

works fine. This however does not:

prefix schema: <http://schema.org/>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?a ?b (?a - ?b as ?diff)
WHERE {
  <http://test.com/a> schema:lastModified ?a .
  <http://test.com/b> schema:lastModified ?b .

}

since ops are done with standard XML class (see com.bigdata.rdf.internal.constraints.DateTimeUtility)

FWIW, comparison ops work fine.

it still insists on using Calendar for math

Oh my.

On a related note - is this as important as monitoring/puppetization/comparing to magnus wdq work? Basically, should we lower this in priority?

We can live with it but arithmetics with negative dates won't work :( And dates like -13798000000 would crash if used in expression like ?date1 - ?date2 since XML can't parse them. I'd like to fix it but it requires some very deep Blazegraph hacking so I haven't figured how to do it yet.

Change 219237 had a related patch set uploaded (by Smalyshev):
T94539: Implement our own math for dates

https://gerrit.wikimedia.org/r/219237

Change 219237 merged by jenkins-bot:
T94539: Implement our own math for dates

https://gerrit.wikimedia.org/r/219237