Page MenuHomePhabricator

spec of how timevalue works and is supposed to work
Closed, ResolvedPublic

Event Timeline

Lydia_Pintscher raised the priority of this task from to Medium.
Lydia_Pintscher updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2015, 4:09 PM
JanZerebecki set Security to None.
Jc3s5h added a subscriber: Jc3s5h.Feb 10 2015, 4:44 PM
Jc3s5h added a comment.EditedFeb 10 2015, 5:05 PM

I note thiemowmde's patch for review at Feb 6, 16:32 says what we are doing now is:

  • URI identifying the calendar model. The actual time value should be in this calendar model,
  • but note that there is nothing this class can do to enforce this convention.

That seems to be more-or-less true, but some entries follow the documentation and put the date in the Gregorian calendar no matter what the date.

Using the convention stated above may mislead data users, since, so far, the date is stored in the ISO 8601 format (with certain errors). The format itself is a declaration that it is a Gregorian date. There is no standard that resembles ISO 8601 but which supports Julian dates, so a standard would have to be created.

Storing the date in the calendar indicated the URI identifying the calendar model creates a many-to-many conversion problem, if Wikidata ever supports calendars in addition to Gregorian and Julian. Data users will have to be able to convert from any Wikidata supported calendar to any calender the user desires. This is much more difficult than being able to convert from the Gregorian calendar to the calendar of the user's choice.

Note that, in effect, the URI identifying the calendar model really only gives the name of the calender, not the format in which the date is stored. It wouldn't be too hard to extend ISO 8601 to support Julian calendar dates (although it would be hard to notify everyone we have done so). But that format would totally break down with some of the calendars out there that we might support. For example, Dershowitz and Reingold in Calendrical Calculations 3rd ed. describe how to convert to/from Julian Day, Modified Julian Day, Julian calendar, Egyptian calendar, Armenian calendar, Coptic calendar, and 24 more.

I have created a proposal on Wikidata at https://www.wikidata.org/wiki/User:Jc3s5h/ISO_8601_profile_for_Wikidata

Unless I forgot something, I believe it covers the points that need to be specified in order to understand a date outside the context of a particular article or book. Of course one might argue that some of my choices are wrong, or could be expressed better, but unless my points are addressed I believe a date is ambiguous.

Change 190257 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Do not call TimeValue string "ISO"

https://gerrit.wikimedia.org/r/190257

Patch-For-Review

I found a comment by Daniel Werner in TimeValue.js:

"Since the data value saves the time as Gregorian, we first have to transform that back into Julian.
NOTE: As of May 15 2013 we decided that this is nonsense and that we should always store the time in its native format dependent on the calendar model."

Unfortunately all relevant Gerrit changes (especially Ic43e1c56845a39324b020f9f4ef527ff22568ffb) are unaccessible, but I have no idea why. @Tobi_WMDE_SW?

Releasing fixes ASAP before deciding how dates should be stored seems like putting the cart before the horse, unless you plan to release an additional patch later after we decide how to store dates.

The horse and the cart are running way ahead, I'm trying hard to catch up. The decision how non-Gregorian dates are to be stored was made in May 2013, see comment T88438#1043068 above. It's just labeled wrong in some less-visible places, e.g. in diffs.

Jc3s5h added a comment.EditedFeb 19 2015, 6:53 PM

The code comment state above does not constitute a decision. I don't know if this project allows important decisions to be made without discussion and hidden in code comments. Even if that is allowed, there is still no decision. There is only a general concept of storing dates in the calendar indicated by the URL of the calendar model. There is still no agreement on the exact format for the date. I am not aware of any widely accepted standard format for writing Julian calendar dates that would correspond to ISO 8601 (which is only for Gregorian dates). The issue of how many zeros to pad Gregorian years with also remains unresolved. Also, dates are stored with zeros for unspecified information, in violation of ISO 8601. For example, this year would be incorrectly stored as +00000002015-00-00T00:00Z.

  1. The timestamp is stored in the calendar model indicated by the URI and not converted to Gregorian. This decision was made in May 2013 and finally realized in the UI in 2014. The tragedy is: Most edits in Wikidata are done via the API and we can't say if a Julian date is correct by looking at the time it was entered. It could be that it was entered incorrectly before the switch and is correct now. On the other hand there are bots and scripts out there that still incorrectly convert Julian dates to Gregorian. This can only be fixed gradually by telling all bot and script developers and by re-checking all Julian dates.
  2. The timestamp is not ISO compliant. This was not the intention. The intention was to have a commonly understood, both human- and machine-readable YMD-ordered format that's independent from the language. We will avoid the word "ISO" in the future. I already started submitting patches for that (see https://gerrit.wikimedia.org/r/#/c/190257/).
  3. Gregorian and Julian dates use the same format. This allows to reuse the same formatters, parsers and validators. This must change in the future when we introduce other calendar models.
  4. The year is padded to have between 1 and 16 digits. There is no further guarantee. Even my suggested minimum of 4 digits (see https://github.com/DataValues/Time/pull/33) does not change that because it does not touch existing dates in the database.
  5. Month and day being zero indicate that they are unknown/undefined/not set. This makes it possible to roundtrip timestamps like "2015-00-00". Users can copy-paste this and the parser will correctly detect "2015" with the precision set to "1 year". The alternative is to store "2015-01-01", which will be incorrectly detected as "1 January 2015" with a precision of "1 day". The disadvantage of storing "2015-00-00" is that it's not ISO compliant and can confuse external parsers, e.g. PHP's parser will convert this to "2014-11-30". Both disadvantages are not relevant. Relying solely on PHP's parser is not possible anyway because it can not deal with, for example, 5-digit years.

thiemowmde's comment of Monday, Feb. 23, 11:07 seems to only consider parsers and validators known to the development team. Maybe it's a fact that no one has ever written a good ISO 8601 parser or validator that can process years with more than 4 digits; I have never found one. If so, there wouldn't be any real disadvantage to Wikidata creating it's own format.

Allowing 0 valued months and days does not fully "make possible to roundtrip timestamps like '2015-00-00'". Saying that "'2015-00-00" must be allowed to achieve roundtrip timestamps implies the precision value is being thrown away during the roundtrip. But many precisions are allowed that will not be preserved during a roundtrip. For example, if the precision is set to 7, century precision, and the precision is thrown away during a roundtrip, it will be impossible to tell if "2000-00-00" is precise to a year, decade, century, or millennium. So the input characters could be reconstructed but the meaning could not be fully reconstructed.

only consider parsers and validators known to the development team.

I'm describing what the development team decided, what's currently in the database and what all users of the data need to know to be able to parse them correctly. It would be great if you could help updating https://www.mediawiki.org/wiki/Wikibase/DataModel#Dates_and_times. You are the most experienced user in this realm so far, your help is very much appreciated.

does not fully "make possible to roundtrip [...]

I did not used the word "fully". This only applies to "day vs. year" (which is the by far most relevant use case) and "month vs. year" precision.

only consider parsers and validators known to the development team.

I'm describing what the development team decided, what's currently in the database and what all users of the data need to know to be able to parse them correctly. It would be great if you could help updating https://www.mediawiki.org/wiki/Wikibase/DataModel#Dates_and_times. You are the most experienced user in this realm so far, your help is very much appreciated.

does not fully "make possible to roundtrip [...]

I did not used the word "fully". This only applies to "day vs. year" (which is the by far most relevant use case) and "month vs. year" precision.

I have made changes to https://www.mediawiki.org/wiki/Wikibase/DataModel#Dates_and_times. Perhaps you could review my changes to see if you agree. One element of the TimeValue structure, calendartime, seems to be redundant so long as only the Gregorian and Julian calendars are supported. If we followed the original plan of using ISO 8601, this element would have been a useful place to store the converted value, but now I don't know what it's for. Perhaps if we supported the Roman calendar (which preceded the Julian calendar) we could use it to preserve an exact date such as "ninth day before the Kalends of Octobera in the consulship of Marcus Tullius Cicero and Gaius Antonius" for the birth date of Augustus; since exact conversions of Roman calendar dates are not possible, the time element could be used to indicate date in a specified proleptic calendar with an appropriate precision,

In any case, I think a better description of the community's vision for the use of calendartime should be give so people don't stuff the database full of their own private idea of what ought to go there.

Perfect, thank you very much!

The calendartime field is, as far as I can tell, a planned extension that was never implemented and became obsolete with the decision made in May 2013. Such stuff should be marked as "obsolete" in the documentation.

In response to thiemowmde's post of Mon., Feb. 23, 17:21, I suggest that rather than marking calendartime obsolete, we mark it reserved for support of calendars other than proleptic Gregorian and proleptic Julian, with no usage rules decided upon yet.

@Js3s5h: Just a note about Julian dates in ISO format: Wikipedia sais:

ISO 8601 fixes a reference calendar date to the Gregorian calendar of 20 May 1875 as the date the Convention du Mètre (Metre
Convention) was signed in Paris. However, ISO calendar dates before the Convention are still compatible with the Gregorian
calendar all the way back to the official introduction of the Gregorian calendar on 1582-10-15. Earlier dates, in the proleptic
Gregorian calendar, may be used by mutual agreement of the partners exchanging information. The standard states that every date
must be consecutive, so usage of the Julian calendar would be contrary to the standard (because at the switchover date, the dates
would not be consecutive). -- https://en.wikipedia.org/wiki/ISO_8601

So, data exchange partners are free to agree to use ISO 8601, at least for dates before 1582-10-15, if they make sure there is no discontinuity at the switchover date (this would be achieved by explicitly identifying the calendar model).

Even if this was not the case, it would be simple enough to just define a new standard that uses the ISO formatting but re-defines the interpretation to be using the Julian calendar.

Regarding what we want to store: The decision to store dates as entered, without any conversion (however in a normalized format), is only consistent with what we do for all other kinds of values in Wikibase/Wikidata. We may very well be adding more calendar models that have no direct conversion to Gregorian, so storing them converted would constitute data loss.

Sadly, there was some confusion about this at some point, leading to some code performing conversions. No matter which interpretation we now choose for the old data, some of it will be wrong, and it will be hard to find out which entries are correct, and which are not.

As to how we store it: ISO-ish form seems like a good choice for Julian. Why not use it?

Change 190257 merged by jenkins-bot:
Do not call TimeValue string "ISO"

https://gerrit.wikimedia.org/r/190257

Tobi_WMDE_SW closed this task as Resolved.Mar 5 2015, 8:34 AM
Tobi_WMDE_SW moved this task from Review to Done on the § Wikidata-Sprint-2015-02-25 board.