Page MenuHomePhabricator

Illegal dates in date type
Open, HighPublic

Description

There seems to be an issue where a year could be recorded as either time="+0000000YYYY-00-00T00:00:00Z" or time="+0000000YYYY-01-01T00:00:00Z" (with precision=9).

These display the same but if you compare the claims they show up as different. I also believe only the latter is correct.

To further complicate things it is not possible to manually change a value from the first to the second in Wikidata without creating a new claim and deleting the old one.

Is there a way of checking how many of the first there are and possible convert these to the latter? It would also be interesting to either prevent one of the ways from being recorded (also via the api).

See also T103378: [Bug] Dates with “month” precision are offset by one month on Wikipedia

Event Timeline

Lokal_Profil raised the priority of this task from to Needs Triage.
Lokal_Profil updated the task description. (Show Details)
Lokal_Profil added a project: Wikidata.
Lokal_Profil subscribed.

From this edit it looks like one of the Widar enabled tools adds such dates.

To what does "illegal" refer to?

You can also store +2015-06-15T00:00:00Z with precision "1 year". The API does not throw the additional bits of information away. Why should it? You can store +2015-06-15T00:00:00Z with precision "15 days", which means you are describing a known uncertainty resulting in a timespan from 2015-06-01 to 2015-06-30. Most of the simple formatters we currently use will render it as "2015-06-15". When you enter "2015" there is no month and no day and the parser detects this as precision "1 year" with no month and no day (...-00-00T00:00:00Z) set.

So I have to ask: What is the issue, the actual and the expected behavior?

I meant illegal as in not following ISO 8601. Upon a closer reading I however see that it says "timestamp in a format resembling ISO 8601" so I guess illegal is the wrong choice of words.

The difference between storing +2015-06-15T00:00:00Z, +2015-01-01T00:00:00Z, +2015-00-00T00:00:00Z is that the third one makes only slightly more sense than +2015-99-99T00:00:00Z which is disallowed.

If however "2015" (with no month and no day) should always be stored as 2015-00-00T00:00:00Z then I'll happily abide and adapt to that. It should then just be made clear so that downstream usage (most notably pywikibot follows the same convention).

In general I would say storing +2015-06-15T00:00:00Z with precision 1 year is also problematic since (in the frontend) saving this then changing the precision to 1 month doesn't work. Thus to the frontend user the 06-15 bit is unrecoverable and largely undetectable. In fact if I edit the value in the frontend and again set the precision to "year" it will simply delete the month, day information (example). But that is a different issue from the current one.

As a result of this entering 2015-00-01 doesn't trigger an error (diff) something e.g. 2015-13-01 does.

I would suggest to simply remove days if they are insignificant, and remove month too if they are insignificant: for the year 2015 (precision 9), use "2015", no more, no less.

We should discuss the implications of the different options (use 00, use 01, omit, or use ** or something). We should consider these options for internal and external JSON.

RDF representation should also be considered, but is different in that we want to be compatible with XSD data types there, and also have to consider calendar conversion.

daniel updated the task description. (Show Details)

Note that this ticket deals with the representation in JSON, while T103378 deals with the representation in Lua. They are related, but have different goals/perspectives.

As far as I understand the actual issue here is not that the format allows 2015-00-00 with month and day set to zero, but:

  1. Naive string comparison fails for TimeValues that are "identical" from a users perspective. I do not think this is an issue we can ever fix with anything we do. It will always be possible to express the same thing with different values. For example, "2015-01-15T00:00:00, precision=DAY, after=1" and "2015-01-15T00:00:00, precision=HOUR, after=24" will be logically the same but different internally. Or there can be two timestamps in Gregorian and Julian describing the same thing but being different internally because of the calendar model. And so on. Naive string comparison is doing it wrong and can not be fixed by disallowing month and day being "00".
  2. Some parser functions are build on top of PHP's date parsing and turn "2015-00-00" into 2014-12-30. I hope most people agree with me that this is broken and should be fixed in PHP, but probably can't because it always was like that and people started building apps on top of this broken behavior and will complain if it changes. However, the solution is pretty simple: when the precision is YEAR, do not use these parser functions but instead just split the year of the timestamp and show it as it is. I really, really do not think that a bug in PHP's date parser should motivate a workaround in the contents of the knowledge base we are building.

I suggest to change the tickets title so it turns into something we can act on, or to close it.

daniel triaged this task as High priority.Sep 10 2015, 3:49 PM

We urgently need a decision on this, since a) the status quo causes issues and b) any change to this will be a breaking change. So we should make a decision ASAP.

Note that we have different use cases to consider:

  1. date representation in Lua
  2. date representation in API JSON
  3. internal representation of dates
  4. mapping dates to RDF

AS far as I understand, this ticket is aimed at the JSON representation (2), and T103378 is aimed at Lua (1), but when discussing this, we should consider all use cases.

Please see T89246. As a non-developer who has not been coding any of this stuff, I was astonished to learn that after = 0 and after = 1 mean the same thing. So the following two TimeValues mean the same thing (assuming unmentioned parameters have the same value):

time = +2013-01-01T00:00:00Z
precision = 11
before = 0
after = 0
timezone = 0
calendarmodel = https://www.wikidata.org/wiki/Q12138

time = +2013-01-01T00:00:00Z
precision = 11
before = 0
after = 1
timezone = 0
calendarmodel = https://www.wikidata.org/wiki/Q12138

That is, they both declare that an event occurred at a point in time that isn't exactly known, but falls between 00:00 hours 1 January 2013 and 00:00 hours 2 January 2013 Coordinated Universal Time.

If the before value were set to 1 in either case, I imagine the event would have occurred between 00:00 hours 31 December 2012 and 00:00 hours 2 January 2013 Coordinated Universal Time; have I got that right?

I think the concept of after = 0 and after = 1 meaning the same thing is so astonishing that it needs to be advertised far and wide at every opportunity.