Page MenuHomePhabricator

Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph
Open, NormalPublic

Description

A date of +0000-01-01 is allowed in wikibase but has no meaning.
This is a follow up on T92006.

Details:

XSD 1.0 BC 1 is year -1
"[ISO 8601] makes no mention of the year 0; in [ISO 8601:1998 Draft Revision] the form '0000' was disallowed and this recommendation disallows it as well. However, [ISO 8601:2000 Second Edition], which became available just as we were completing version 1.0, allows the form '0000', representing the year 1 BCE."
[ISO 8601] refers to the one from 1988-06-15. The ISO references are non-normative.

XSD 1.1 BC 1 is year 0
"is consistent with the current edition of [ISO 8601]."
[ISO 8601] here also links to the one from 1988-06-15. But the references also list ISO 8601:2000 Second Edition which is never linked to in the rest of the spec. The ISO references are non-normative.

Both have the same namespace.

https://docs.oracle.com/javase/8/docs/api/index.html?javax/xml/datatype/XMLGregorianCalendar.html says it follows XSD 1.0 and ISO-8601-1988 and allows year 0. It allows conversion to GergorianCalendar but doesn't explicitly say how negative years are converted
https://docs.oracle.com/javase/8/docs/api/java/util/GregorianCalendar.html BC 1 is year 0

SPARQL 1.1 has XSD 1.0 as a normative reference.

Seems only way to know if it works is to actually test each implementation. (As the specs and documentation are too broken to rely on.)

Tests on Blazegraph:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( year("0000-01-01T00:00:00"^^xsd:dateTime) AS ?date)
}
MalformedQueryException: "0000-01-01T00:00:00" is not a valid representation of an XML Gregorian Calendar value.
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( "0001-01-01T00:00:00"^^xsd:dateTime - "-0001-01-01T00:00:00"^^xsd:dateTime AS ?date)
}
366.0

It seems Blazegraph does not support subtracting a duration from a dateTime:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( "0001-01-01T00:00:00"^^xsd:dateTime - "P1Y"^^xsd:duration AS ?date)
}
Cannot add process datatype literals:"0001-01-01T00:00:00"^^:"P1Y"^^

Event Timeline

JanZerebecki raised the priority of this task from to Needs Triage.
JanZerebecki updated the task description. (Show Details)
JanZerebecki added a subscriber: JanZerebecki.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 26 2015, 6:53 PM

OK, so that means we can not have year 0. The question is, what should we do with it? We have quite a lot of such items:

http://milenio.dcc.uchile.cl/sparql?default-graph-uri=&query=PREFIX+%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E+SELECT++%3Fsdate+WHERE+%7B+%3Fv+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fontology%23time%3E+%3Fdate+.+%0D%0ABIND+%28str%28%3Fdate%29+as+%3Fsdate%29%0D%0AFILTER+regex%28%3Fsdate%2C+%270000-%27%29%0D%0A%7D+LIMIT+100&format=text%2Fhtml&timeout=0&debug=on

Looks like they are used when year is missing, e.g.:
https://www.wikidata.org/wiki/Q6765647
https://en.wikipedia.org/wiki/Marissa_Stott

Or:
https://www.wikidata.org/wiki/Q276378
https://en.wikipedia.org/wiki/Makoto_Tateno

We may need to develop some way to express this. This will be useless for range searches but may be useful for something like "birthday on this day" type of searches (discussed also on the list).

Smalyshev renamed this task from date of +0000-01-01 is allowed in wikibase but has no meaning to Date of +0000-01-01 is allowed in wikibase but has no meaning as xsd:dateTime.Mar 26 2015, 7:19 PM
Smalyshev claimed this task.
Smalyshev triaged this task as Normal priority.
Smalyshev set Security to None.

Note that all current data representation formats assume that "0000-01-01T00:00:00" is a valid representation:

Moreover, XML Schema 1.1 argues that this change was made "in order to agree with existing usage", and I would agree there: many existing documents used the ISO interpretation of years even before it became official. In other words, if we want to export data to RDF, we should definitely conform with current usage and standards. I imagine that it would be easy for BlazeGraph to use either semantics if we asked for support there.

Regarding the intention of SPARQL 1.1, I now have sent an enquiry to the former SPARQL WG:
http://lists.w3.org/Archives/Public/public-sparql-dev/2015JanMar/0031.html
which will hopefully lead to further clarification on this matter.

JanZerebecki renamed this task from Date of +0000-01-01 is allowed in wikibase but has no meaning as xsd:dateTime to Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph.Mar 27 2015, 2:47 PM

The lexical representation where the year fraction is 0 has undefined meaning in Wikibase, so we can not assume it is a month-day without a year. I think the easiest way is to represent is as a date with undefined meaning as string, like we should additionally do for things we munge to near values with meaning like Feb 31. Then go and define a month-day without year in Wikibase.

About representation of years BCE: Thank you for starting that mailing list thread. It seem that it tends towards saying that the normative thing for SPARQL 1.1 is XSD 1.1 with the reasoning that XSD 1.1 retroactively changes anything that refers to XSD 1.0 as otherwise SPARQL 1.1 Query Language and Entailment Regimes would disagree. If that remains, the question is what to do regarding implementation reality which because of Java might lean towards XSD 1.0. Implementations of XPath functions (like libxslt) and XSD validators (like libxml) are probably also relevant.

Virtuoso seems to implement XSD 1.0:

prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( "0001-01-01T00:00:00"^^xsd:dateTime - "-0002-01-01T00:00:00"^^xsd:dateTime AS ?date)
}
<res:value datatype="http://www.w3.org/2001/XMLSchema#integer">63158400</res:value>

= (365+366)*24*60*60

Virtuoso 22007 Error DT006: Cannot convert 0000-01-01T00:00:00-03:00 to datetime : Incorrect year value

SPARQL query:
define sql:big-data-const 0 
#output-format:application/rdf+xml
prefix xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?date
WHERE {
  BIND ( "-0001-01-01T00:00:00"^^xsd:dateTime AS ?date)
}

Maybe we should wait on acting on this until Java XMLGregorianCalendar and libxml/xslt were changed?

The lexical representation where the year fraction is 0 has undefined meaning in Wikibase,

True. That's the complication - it can both be used for "year before 1AD" and for "we don't know what year it was but it was July 4th".

As we probably will implement custom date parser in BlazeGraph, we can have special handling for year 0. The question is *what* that handling should be - given the above especially. We don't want a person having "born on JUly 4th unknown year" show up in the list of "persons born in 1 BCE".

Yes, the discussion on SPARQL has converged surprisingly quickly to the view that XSD 1.1 is both normative and intended in SPARQL 1.1 (by the way, I can only recommend this list if you have SPARQL questions, or the analogous list for RDF -- people are usually very quick and helpful in answering queries, esp. if you say why you need it ;-).

Your findings about Virtuoso and BlazeGraph show that it might be hard to find a conforming processor right now. However, I would still hope that it can be done, since the transformation of the values is quite easy after all. In fact, I think that neither of these projects is very likely to have customers who have cared about BCE years so far ;-). Technically, it should be not too hard: even if you use an XSD 1.0 library in most places, you could surely find an XSD 1.1 library to use with date functions, or you could transform dates internally before passing them to the XSD 1.0 operator functions. If such transformation is not efficient enough, one could also convert all input to XSD 1.0 dates on loading (or before) and then merely translate dates in queries and results accordingly (should not be a big performance issue since these datasets are rather small). However, I think it should be easy to take the few affected XPath functions from another library or to implement them directly. Julian day calculation is a very simple algorithm (https://en.wikipedia.org/wiki/Julian_day#Calculation), and this is all you need for date comparisons, calendar conversion, and time intervals. One way or the other, implementations will most likely have to do some custom extensions of internal date handling, unless the standard XSD libraries can cope with the age of the universe.

Data publication is another issue. It's clear that we need to use XSD 1.1 when we publish RDF online, since this is what the current RDF specification requires. Applications that find our data on the web cannot know what we discussed and there is no way of telling them. They can only assume that we are using the current standard.

For JSON and Wikidata internally, the main reference is probably ISO 8601 not XSD (I don't actually know what JSON says here, but usually it says nothing about things other than primitive Javascript types). I'd find it hard to explain why we would choose to deviate from that. Year 0000 has been legal in ISO for many years before Wikidata was even started. @Lydia_Pintscher recently triggered an action to review the dates stored internally in Wikidata, so this issue should probably be part of it (esp. since all BCE dates that are exact to the year should use Julian calendar). As you said, we already have year 0000 in values, and it is likely that we have other negative year numbers entered by the same bots (assuming ISO semantics). So we need to change many dates one or the other way in any case. I think most technical consumers will appreciate if we stick to ISO since it is easier to do calculations with.

Moreover, now that every standard has agreed to use the same format, time is working for those who go with it.

"we don't know what year it was but it was July 4th"

Ouch. Where has this been designed? Can you point to the specification of this?

@Denny, is this intended? Dates without a year are extremely hard to handle in queries and don't work at all like the normal dates we have. This should be a different datatype.

@mkroetzsch I don't think it is specified anywhere, it's just what people do. See https://en.wikipedia.org/wiki/Marissa_Stott and its wikidata item, and another one above. If you just look for items with date 0 in RDF, I'm sure many of them are exactly that. I don't like it but that's what we have in the data, so we need to decide what to do with it.

OTOH, while I agree we should eventually be RDF 1.1 compatible, it does not mean we're obliged to represent all dates in the DB as xsd:dateTime, and I think we are allowed to tweak xsd:dateTime interpretation somewhat. And we don't want to get wrong query results.

For now, until we figured out the whole "how dates are represented internally" thing, I think we should take this road for simple values (deep values always have original string):

  1. AD dates go to xsd:dateTime, if there's an invalid date like February 31, we make it last day of February that year, and so on. That would allow us to range-search it.
  2. Year 0 is invalid date for now, until we decide what to do with it.
  3. Negative years are translated into xsd:dateTime as is, i.e. year -1 in data is year -1 in RDF's xsd:dateTime

The last one may be dangerous if the loading tool goes after RDF 1.1 as 1 BCE in the data is rendered as -0001. So we can in theory dump it as -0000 but then most RDF tools wouldn't load negative dates correctly.
No idea currently what to do with this.

@Smalyshev @Lydia_Pintscher Dates without years should not be allowed by the time datatype. They are impossible to order, almost impossible to query, and they do not have any meaning whatsoever in combination with a preferred calendar model. All the arguments @Denny has already given elsewhere for why we should unify dates to Proleptic Gregorian internally apply here too. My suspicion is that the existing dates of this form are simply a glitch in the UI, where users got the impression that dates without years are recognized and pressing "save" silently set the year to zero without them seeing the change in meaning. If this is an important use case, then we should develop a day-of-year datatype that supports this, or suggest the community to use dedicated properties/qualifiers to encode this. However, other datatype extensions would be much more important than this rare case (e.g., units of measurement).

The above proposal of @Smalyshev is simply to use RDF 1.0 for export and to assume XSD 1.0 (non-ISO) dates to be used in Wikidata. After all the discussion here, I am completely baffled by this proposal. It goes against all current standards, and against the view of the SPARQL working group. The additional proposal to revert to the "dates are just strings" view for deep values ignores the original design and documentation, and dismisses the recommendations that Denny and I have been making via email. It seems we have reached an impasse here.

I suggest to freeze the RDF-time encoding discussions now until we have established a joint understanding what dates in Wikidata mean. As soon as we export dates to RDF, we are defining their meaning indirectly via the RDF semantics, and this bug report is not the right place for doing this.

@mkroetzsch Do you know of some widely used software that implements XSD 1.1 handling of BCE dates?

Dates without years should not be allowed by the time datatype.

what dates in Wikidata mean

I think the best way forward is to leave the lexical 0 in the year fraction as undefined in Wikidata. Yes its used, but AFAIK it was always undefined.

If this is an important use case, then we should develop a day-of-year datatype that supports this, or suggest the community to use dedicated properties/qualifiers to encode this. However, other datatype extensions would be much more important than this rare case (e.g., units of measurement).

I agree.

@mkroetzsch Do you know of some widely used software that implements XSD 1.1 handling of BCE dates?

Many applications that process dates are based on ISO rather than on XSD. Java's SimpleDateFormat class, for example, is based on ISO and thus interprets year numbers like XSD 1.1. I would assume that most time-processing applications, e.g., JavaScript timelines, use the same. Only XSD-based implementations tend to have legacy handling. For many RDF tools it is really hard to tell without digging into their code (usually they don't document this detail, and they use own implementations rather than relying on any XSD library). But I think it is fair to assume that ISO has a much larger market share, and that XSD 1.0 implementations will be updated at some point.

I think the best way forward is to leave the lexical 0 in the year fraction as undefined in Wikidata. Yes its used, but AFAIK it was always undefined.

Our original specification of Wikidata said: "The calendar model used for saving the data is always the proleptic Gregorian calendar according to ISO 8601". This is the specification that Denny and I support, but there have been changes recently. WMDE is currently in the process of reviewing these changes to gauge the impact they have had in the data over time, and to come up with ideas how to recover to a consistent state. We have to await their report and suggestions before deciding what to do in RDF.

All the software I checked that handles XSD types implement XSD 1.0 BCE years. Together we know of nothing that implements XSD 1.1 BCE years. I'd suggest we produce output for the RDF tools that exist.

"The calendar model used for saving the data is always the proleptic Gregorian calendar according to ISO 8601".

And that does not refer to the 2000 version, so 0000 is an invalid year. I'm not aware of any changes there. It is just not rejected during validation, like a Feb 31.

We have to await their report and suggestions before deciding what to do in RDF.

The only review I'm aware of is related to dates in Julian calendar only.

I agree that we should gather more data before continuing the discussion about the interpretation of dates stored in wikidata.

@mkroetzsch Do you know of some widely used software that implements XSD 1.1 handling of BCE dates?

Many applications that process dates are based on ISO rather than on XSD.

Which version of ISO 8901, though?

From Wikipedia: ISO 8601:2004 (and previously ISO 8601:2000, but not ISO 8601:1988) explicitly uses astronomical year numbering in its date reference systems. https://en.wikipedia.org/wiki/0_%28year%29#ISO_8601

ISO doesn't save us, since they made the same breaking change, just a few years earlier. If the spec or library docs don't explicitly say whether they use 8901:2000 or later, we might think they use ISO 8601:1988, which does the same as XSD 1.0: it does not allow year zero, and counts -1 as 1 BC.

Dates without years should not be allowed by the time datatype

That's fine but they are already there, so I'm not sure how we can say "should not be allowed" there. We have to do something when we encounter them. What is something?

The additional proposal to revert to the "dates are just strings" view for deep values ignores the original design and documentation, and dismisses the recommendations that Denny and I have been making via email.

It does not dismiss anything. Given current state of the data (invalid dates, zero dates, etc.) I do not see how we can faithfully represent the data in the deep value as anything else. If you have better idea, please propose. If/when we get guarantee that the date in the source value is valid representable Gregorian date, we can type it as xsd:dateTime, but before that I don't see how we can do that.

suggest to freeze the RDF-time encoding discussions now until we have established a joint understanding

Establishing understanding is fine, but the code which produces RDF has to produce something when it encounters time value. We can not just have all the work wait until undefined time where we reach an understanding. So what should this code produce now?

As soon as we export dates to RDF, we are defining their meaning indirectly via the RDF semantics, and this bug report is not the right place for doing this.

We can open another task if needed, though I don't see why this one is particularly unsuitable. In any case, I am not proposing anything that has to be enshrined as forever standard, and we are not releasing the dump as even internal standard, let alone something we promise never to change publicly. But we need to have something so that we could have it working and use it for query engine work.

We have to await their report and suggestions before deciding what to do in RDF.

I don't think halting all work on query engine until we reach full consensus on this point is realistic. I don't also think that data not including dates is really worth any use, even in beta status. So that means we have to export data somehow. Even if we know that we may change it before we get out of beta status. The question is what that representation would be. I proposed what I see as a good solution given the status of affairs now. If that's not good, fine, I'm completely open to hearing other proposals.

@Smalyshev

Re "halting the work on the query engine"/"produce code now": The WDTK RDF exports are generated based on the original specification. There is no technical issue with this and it does not block development to do just this. The reason we are in a blocker situation is that you want to move forward with an implementation that is different from the RDF model we proposed and that goes against our original specification, so that Denny and I are fundamentally disagreeing with your design. If you want to return to the original plan, please do it and move on. If not, then better wait until Lydia has a conclusion for what to do with dates, rather than implementing your point of view without consensus. For me, this is a benchmark of whether or not our current discussion setup is working.

Here is why I am optimistic that we can align with RDF 1.1 and ISO 8601:2000 before the query engine would even go live: Basically all calendar-accurate BCE dates will be revised and many of them will be changed because of the ongoing date review. We can well fix the year zero issue at the same time. Thus we can as well work on the hypothesis that dates are in ISO 8601:2000 as originally intended. From the feedback we got from the SPARQL group, it seems that this would be preferable, if we can make it work technically. The date review is a great opportunity to get the whole internal representation back on track.

Re deep value model: the core of the issue is that you propose to represent dates as the "original" string. Denny and I have clarified that we don't find this an acceptable representation for dates. As opposed to the XSD 1.0 issue, this proposal leads to a completely different structure in RDF and queries. There is no upgrade path from this implementation to the one we actually want. If we can agree on getting rid of this first, this would be a good start to move on. Changing from XSD 1.0 to XSD 1.1 is a minor issue in comparison, and one which can be deferred in implementation until we have BlazeGraph support for this.

@daniel @JanZerebecki

Feel free to post a list of the RDF tools that you found to implement RDF 1.0 rather than RDF 1.1 in terms of dates.

We wrote the specification in 2012, when ISO 8601:2000 had long been the established standard that people were using, so it is surprising to us that you thought that we would mean a standard from 1988 when not speficying further details. Anyway, it's good that the confusion has been discovered now, just in time to get everything fixed to the state we actually want. The issue is more important now than it was in 2012, since all major W3C standards are now relying on this interpretation.

Is there any issue with for now just not including those dates in the RDF export? That'd allow @smalyseh to continue working on queries while we figure out the rest. It also shouldn't block us in whichever way we go forward later.

And I agree that using year 0 to indicate an unknown year is very bad. We need to find a better solution for that usecase.

@mkroetzsch I already listed a few of the tools that implement XSD 1.0 style BCE years and I read your answer as to say that you know of no tools that implement XSD 1.1 style BCE years.

@mkroetzsch I already listed a few of the tools that implement XSD 1.0 style BCE years and I read your answer as to say that you know of no tools that implement XSD 1.1 style BCE years.

Then you misread my answer. Almost all tools that exist today use the 2000 version of the ISO standard. A prominent example is ECMAScript, and thus all JavaScript implementations, and virtually every JavaScript timeline implementation. See http://www.ecma-international.org/ecma-262/5.1/#sec-15.9.1.15 and the examples in the following section to see this. RDF tools are an understandable exception, not because people there think one should cling to the old standard, but because RDF 1.1 was only standardised in 2014. It is natural that existing RDF implementations have more pressing upgrade work to do than to fix BCE date handling. I am sure they will all move to the new standard in due course.

It is worth noting that all of the ECMAScript documents use "ISO 8601" to refer to "ISO 8601:2000", just like we did in the Wikidata data model specification. It seems that most people are not confused by this.

Having dug up ECMA from the Web, I can now also safely say that JSON exports should definitely use ISO 8601:2000 dates.

The new situation therefore is: ISO, W3C, and all JavaScript implementations vs. a subset of the developers in the WMDE office. I am very unhappy about the amount of my time I have to put into digging up for you what the rest of the world is thinking. It's a nice position to put yourself in, asking others to find specific arguments against your position and assuming you are right if they don't have the time or knowledge to do it. Now I am myself far from being an expert in JavaScript or even in all details of SPARQL 1.1, but if I don't know something I try to find out before taking part in discussions like this.

The WDTK RDF exports are generated based on the original specification. There is no technical issue with this and it does not block development to do just this.

If by original specification you mean assumption "all data is proleptic Gregorian", then it does not match the current data. I.e. if I just make code assume that, it will generate a real lot of broken dates which will not be interpreted properly by the query engine. In fact, almost all Julian dates will be wrong, and many others will be broken too. I'm not sure how useful it would be to take this road - why have broken data in our dump?

If not, then better wait until Lydia has a conclusion for what to do with dates, rather than implementing your point of view without consensus.

I'm not sure how it's better. If any decision is made, we can always change the code, but just sitting with our hands folded and doing nothing doesn't look like a good idea.

Re deep value model: the core of the issue is that you propose to represent dates as the "original" string. Denny and I have clarified that we don't find this an acceptable representation for dates.

OK, but what I still miss is what you consider acceptable that would be able to represent current data. If we have date of 0000-02-31 in the data, what you propose for the RDF data to contain? What is it's marked as Julian date - what should the data contain? What if it is marked with calendar that is neither Gregorian nor Julian?

There is no upgrade path from this implementation to the one we actually want.

Why not? If the format changes, you can update your data pretty easily by removing old value nodes/triples and replacing them with new ones. That provided somebody would actually use our beta data and get deep enough so it would be a problem it time it takes us to make a decision. Which, if that is going to take so long time, is yet another argument for not blocking on it.

Thus we can as well work on the hypothesis that dates are in ISO 8601:2000 as originally intended.

I understand that neither BlazeGraph not Virtuozo do not actually interpret the dates as ISO 8601:2000. We need them to understand our dates. I'm not sure how you propose to solve this? Or am I mistaken in interpreting Jan's conslusions and they are ISO 8601:2000?

@Lydia_Pintscher could you clarify what you mean by "those dates"? We want to represent all dates, I think, so are you proposing to just ignore the triples with "weird" dates? That would mean the data would look as if these dates do not exist - which may confuse some queries (e.g person with no date of death is considered alive, but it's not the same as somebody having the date of death as April 31th 4BCE).

Yes I am proposing to drop dates that have year set to 0 for now from the RDF export. We can communicate that and need to do this anyway as it is really bad practice.

@Lydia_Pintscher this means for the dump there will be no difference between "has property value with date containing year 0" and "has no property valye". I'm not sure it is a good idea. Having no property value is meaningful (e.g. having position start/end date, having death date, having organization creation/dissolution date, etc.) so it may lead to wrong query results. We may use string or Somevalue, but dropping existing data seems too radical for me.

It is the only way I see forward right now that will let you continue working on queries quickly. (Anything else needs considerably more discussion/investigation as this ticket shows) And I consider it acceptable in this case. How many cases are we talking about?

I'm not sure why it is the only way. Certainly using Somevalue or string or any other placeholder value are ways too. They can be worse ways if you see arguments against them but why they are not ways at all? Why these ways can not be considered? Also, we're not talking only about year 0 - there are many non-Gregorian dates there.

Judging by this:
http://milenio.dcc.uchile.cl/sparql?default-graph-uri=&query=PREFIX+%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E+SELECT+count%28%3Ftime%29+WHERE+%7B+%0D%0A++%3Fx+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fontology%23time%3E+%3Ftime+.+%0D%0A++FILTER+regex%28str%28%3Ftime%29%2C+%22%5E0000%22%29+.%0D%0A%7D+&format=text%2Fhtml&timeout=0&debug=on

We have 51 entries with year 0 (if the data there is full and up-to-date) and 3541 negative dates. Not sure how to count invalid dates but there is a date like "1966-02-31"^^http://www.w3.org/2001/XMLSchema#date in the DB there which doesn't look like a valid one - looks like Virtuoso is more lenient in allowing invalid data than BlazeGraph.

We also have 12806 entries marked as Julian.

@Smalyshev We really want the same thing: move on with minimal disturbance as quickly as possible. As you rightly say, the data we generate right now is not meant for production use but for testing. We must make sure that our production environment will understand dates properly, but it's still some time before that. Here is my proposal summed up:

  1. Implement RDF export now as if Wikidata would encode all dates in ISO 8601:2000 (proleptic Gregorian with year 0 encoding -1BCE)
  2. Have a switch in the RDF export code that allows us to export to RDF 1.1 or to RDF 1.0.

Item 1 will ensure that we can work with the dates as they will most likely be when we enter production. With the discovery that all of JavaScript relies on ISO 8601:2000, there is not much of a question that we will have this corrected in the end. It would be a waste of programming time to work around issues that others are already trying to fix as we speak. We can still implement reinterpretations of the internal dates when we find that the internal format is still broken when we want to release this (I hope this won't happen).

Item 2 is a compromise. It will ensure that we can use BlazeGraph even before the xsd:date bug is fixed. Has anyone reported the issue to them yet? It might well be that they are quicker fixing this than we are finishing this discussion. I am sure they would also like to conform to SPARQL 1.1.

I agree with you that some dates will not be interpreted as intended, but this is unavoidable (whatever rule we pick, we will always have some dates that are not as intended, already because of the calendar model mix-up). We have to rely on the ongoing review to get this fixed. This should not worry us right now, as it affects everyone (including actual production uses of Wikidata data from the API or JSON dumps).

@Smalyshev P.S. Your finding of "0000" years in our Virtuoso instance is quite peculiar given that this endpoint is based on RDF 1.0 dumps as they are currently generated in WDTK using this code: https://github.com/Wikidata/Wikidata-Toolkit/blob/a9f676bfbc2df545d386bfa72e5130fa280521a9/wdtk-rdf/src/main/java/org/wikidata/wdtk/rdf/values/TimeValueConverter.java#L112-L117

Item 1 will ensure that we can work with the dates as they will most likely be when we enter production.

I'm not sure what is the basis of this assertion? I didn't see any plans for BlazeGraph to move to new standard in the near term, and the same for Virtuoso. I will create an issue with Blazegraph but even if/when we convince them to move (as this would mean everybody that uses older 1.0 format will not be able to receive correct results from RDF anymore, I'm not sure they'd be eager to switch) it may take time, and more realistically we'd probably to have to rely on our own date handling eventually. However, this is for BlazeGraph, but loading the same data into Virtuoso would still produce broken results for any BCE date.

More importantly, do we have a commitment that Wikidata data format will be changed in the very near future so that all stored dates are actually valid proleptic Gregorian? We've initiated this discussion very recently and I didn't see any definite resolution yet, much less commitment for the time when the data would actually be like this (and how Julian dates will be represented in this case). Do you propose that until this happens, our date values in RDF dump for every Julian date, evern BCE date, and all dates like yyyy-02-31 should be unusuable? I'm not sure how that improves anything.

It would be a waste of programming time to work around issues that others are already trying to fix as we speak.

There's no programming time - current model is already implemented (and Julian handling too). Of course, it's not a problem to change it, but for that we need to know what we're changing it to and how it would work. Assuming that everything is ISO 8601:2000 when in fact it is not does not seem like correct way - BlazeGraph would just reject invalid dates (so, not useful for queries), and even worse - it would consume current BCE dates not in ISO 8601:2000 but in its own understanding of dates, which will make all queries touching 1BCE and below invalid, at least until we implement custom date handling. I'm not sure I understand why this is the best option, unless we know all internal dates move to ISO 8601:2000 very soon.

I agree with you that some dates will not be interpreted as intended, but this is unavoidable

It is true that there will be invalid dates. But you seem to be proposing just ignoring this fact and put them in the data - with full knowledge they are invalid and either can not be consumed by the query engine or, even worse, will be interpreted incorrectly by the query engine. I am proposing to try and fix those that we can so that we allow the query engine to make sense for most of them. For others, to just provide string representation and not claim that some random assembly of characters that we can not validate is actually xsd:dateTime.

Your finding of "0000" years in our Virtuoso instance is quite peculiar

Virtuoso seems to be able to import invalid dates. I'm not sure if it's actually able to index it (probably not but I can check). However other tools can reject them or even fail the whole import.

Virtuoso seems to be pretty odd. E.g. this query:

PREFIX : <http://www.wikidata.org/entity/> 
SELECT ?x  ?time WHERE { 
  ?x <http://www.wikidata.org/ontology#time> ?time . 
  FILTER (?time < "0002-01-01"^^xsd:date) 
} LIMIT 100

returns only a handful of items with year 1 but not any dates with BCE. Same when using "0002"^^xsd:gYear. Moreover, this one:

PREFIX : <http://www.wikidata.org/entity/> 
SELECT count(?time) WHERE { 
  ?x <http://www.wikidata.org/ontology#time> ?time . 
  FILTER regex(str(?time), "^-") .
  FILTER (?time > "0001"^^<http://www.w3.org/2001/XMLSchema#gYear>) .
} LIMIT 100

produces 3541 - i.e., Virtuoso thinks all negative years are actually bigger than year 1. No idea what's going on there but I suspect it's not what we want.

@Smalyshev You comment on my Item 1 by referring to BlazeGraph and Virtuoso. However, my Item 1 is about reading Wikidata, not about exporting to RDF. Your concerns about BlazeGraph compatibility are addressed by my item 2. I hope this clarifies this part.

As for the wrong dates, I simply say that we do not know how to fix them, since the errors are not sufficiently systematic. At best we can replace one kind of error with another kind of error. I agree with you that wrong dates are a bad thing, but a bad thing that is beyond our power to fix. We should focus on the RDF export and rely on others to do their work, so that everything will run smoothly in the end. At the current stage of the RDF work, the issue is of relatively minor relevance compared to the problems it is causing elsewhere. All of our current RDF exports and several applications that people are using suffer from the same errors in Wikidata. We need to fix it at the root, not in each consumer.

Manybubbles moved this task from Needs triage to WDQS on the Discovery board.May 7 2015, 7:50 PM
In T94064#1158423, @mkroetzsch wrote in part:

Julian day calculation is a very simple algorithm (https://en.wikipedia.org/wiki/Julian_day#Calculation), and this is all you need for date comparisons, calendar conversion, and time intervals.

For an environment that allows dates from the beginning of the universe to the estimated destruction of the Solar System, Julian date conversion is not so simple; the algorithm in the English Wikipedia article, copied from a reliable source, fails for Julian dates that are slightly less than zero (before approximately 4713 BCE). The widest range algorithms I know of are in Dershowitz and Reingold's ''Calendrical Calculations''; I recall they tested them for 10,000 years before and after the present (but I can't recall which page this claim is on).

@Jc3s5h You are right that date conversion only makes sense in a certain range. I think the software should disallow day-precision dates in prehistoric eras (certainly everything before -10000). There are no records that could possibly justify this precision, and the question of calendar conversion becomes moot. Do you think 4713BCE would be enough already, or do you think there could be a reason to find more complex algorithms to get calendar support that extends further to the past?

daniel added a comment.EditedMay 19 2015, 4:09 PM

@mkroetzsch precise dates for prehistoric times may be useful for astronomical events. These could/should use a different calendar model though, such as https://en.wikipedia.org/wiki/Julian_day (Q14267), see T59704: Support Julian Date (astronomy)

Jc3s5h added a comment.EditedMay 19 2015, 4:50 PM

@mkroetzsch precise dates for prehistoric times may be useful for astronomical events. These could/should use a different calendar model though, such as https://en.wikipedia.org/wiki/Julian_day (Q14267), see T59704: Support Julian Date (astronomy)

To support dates of prehistoric astronomical events, the issue isn't whether to use the Julian day, the proleptic Julian calendar, or the proleptic Gregorian calendar. The issue is that the theories of motion of the Solar System, use time scales such as Terrestrial Time or Barycentric Coordinate Time. These time scales use seconds of equal length, very similar to the seconds produced by atomic clocks. But calendars conventionally count actual solar days. Because the rate of rotation of the Earth is steadily decreasing January 3, 10,000 BC, Julian proleptic calendar modified to observe Terrestrial Time would be approximately January 1, 10,000 BC, Julian proleptic calendar.

The actual rotation rate of the Earth is not well known enough to create a definitive conversion between calendar dates and Terrestrial Time (and other similar timescales) dates. So support of prehistoric astronomical events would require support for multiple time scales; one or more for actual observed solar days, and others where the length of a day is close to 86,400 atomic seconds.

ksmith moved this task from WDQS to On Sprint Board on the Discovery board.Aug 27 2015, 8:26 PM

I think as it concerns Blazegraph and RDF, everything that needed to be done is done. We support XSD 1.1 now. So I am removing the Wikidata Query parts from it.

I created T112703: "Fix display of dates in user interface". This bug touches on how a date will be stored in wikibase and dealt with by XSD, but not how it will be displayed in the user interface. The current display in the user interface is wrong.