Page MenuHomePhabricator

Time data-type inconsistently zero-pads year value (dates earlier than year 1000?)
Closed, ResolvedPublic

Description

REPRODUCTION

q937 +00000001955-04-18T00:00:00Z
q44269 +000000000343-12-06T00:00:00Z

  • Note that the date with year 343 has an extra leading 0

RELEVANCE

: It ends up extracting a date of "00343-12-0" and triggers the year 10,000 bug.

  • As a result, Nicolau de Mira shows up as having died in 2003 instead of 343.

OTHER NOTES


Version: unspecified
Severity: major
Whiteboard: u=dev c=backend p=0
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=26181
https://bugzilla.wikimedia.org/show_bug.cgi?id=60999

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:16 AM
bzimport set Reference to bz64084.
bzimport added a subscriber: Unknown Object (MLST).

The years are currently always padded by the TimeValue when being stored.
As time has passed the level of padding has changed.

Can you not just trim the 0s from the left of the string extracted?

This is a sensible suggestion, but the script was not written by me. I only happened to come across this issue on Catalan Wikipedia while investigating the Year 10,000 bug. See: https://bugzilla.wikimedia.org/show_bug.cgi?id=30148#c11

I would leave a note on the talk page, but it is a non-English wiki, and I generally don't know what the policy is for feedback in other languages (do I translate the note using an online translator?)

In addition, there would probably be multiple wikis that need to be changed. In addition to Catalan, there are four other wikis that use similar logic for their main Wikidata Module. See the list below.

Hope this helps.

affected: uses "9" as substring for date

unaffected: uses a regex

currently unaffected: no code handling time data-type

currently unaffected: Module:Wikidata doesn't exists

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

ISO 8601:2000 contains the following requirement:

"If, by agreement, expanded representations are used, the formats shall be as specified below. The
interchange parties shall agree the additional number of digits in the time element year. In the examples below
it has been agreed to expand the time element year with two digits.
"a) A specific day
"Basic format: ±YYYYYMMDD Example: +0019850412
"Extended format: ±YYYYY-MM-DD Example: +001985-04-12"

Above Addshore added a comment "As time has passed the level of padding has changed." Unless a bot went through and changed all the existing stored dates to have the same number of digits as the new amount of padding, this change has thrown Wikidata out of compliance with ISO 8601. There are no ISO 8601 compliant dates in Wikidata. Every date in Wikidata is invalid.

So:

Thus:

  • The documentation in the comments at the top of TimeValue should be fixed as it is incorrect! ( https://github.com/DataValues/Time/issues/28 )
  • Old information and old revisions in the database can not be changed and should not be changed.
  • The number of zeros in a year for an item can be changed buy fixing that claim and making a new revision of the item.
  • The constructor for TimeValue needs to continue to allow many different lengths of year due to legacy data, we need to stay backwards compatible.

All new dates added through the UI should be 8601 compliant with 16 digits in the year
We need to again check the API and see what can be manually added, I suspect a year of any length can be added :/

In light of Addshore's comment of January 26, 2015, 23:12, I would say that a bot must be created to change every date so the year contains exactly 16 digits. Also, since ISO 8601 requires agreement among the data exchange partners as to the number of digits in a year, until the documentation is fixed, there are no valid dates in Wikidata. After the documentation is fixed, all dates with other than 16 digits are invalid.

I am unable to find any tool that promises to return the data in exactly the form it is stored in the Wikidata database. I did find one method of displaying the contents of an item; an example of the url is

https://www.wikidata.org/wiki/Special:EntityData/Q692.json

This causes the data for William Shakespeare to be displayed in a browser window in json format. The relevant data for the death date appears to be:

{"time":"+00000001616-05-03T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"},"type":"time"}},"type":"statement","rank":"normal"

Notice the year, 1616, has 11 digits. I'm not prepared to say if that is what is actually stored, or if some bit of software has formatted it for presentation in the json format.

Uh, wait. What do you guys want to "fix" here? There is only one real bug here: Code that does weird substring calls with fixed positions. Why would anybody do that? Why do you think padding is required to be ISO compliant? It doesn't say anything about that, as far as I can see.

We had some legacy code (not necessarily PHP, also JavaScript) that padded the year to 11 digits. Other code pads the year to 16 digits. Note that such padding is not part of the ISO standard. Therefor my proposal is to drop the padding completely, not even 4 digits (which is what some people prefer).

In reply to thiemowmde, P570 (Date of death; the same argument applies to P569, Date of birth) |has datatype TimeValue. The description of the datatype is at https://www.mediawiki.org/wiki/Wikibase/DataModel. The first element the TimeValue structure is described as follows:

time (isotime): point in time, represented per ISO8601, they year always having 11 digits, the date always be signed, in the format +00000002013-01-01T00:00:00Z

I would take this to mean that there are methods for providers of data to present a time stamp in this format to the database to be stored, and there are methods for consumers of data to extract the timestamp in this format from the database, and the documentation serves as a specification so developers will know what they must accept and what they must provide. Invocation of ISO 8601 is a commitment to providers and consumers of data that the number of digits in the year will always be exactly 11 digits. If the year length is to be variable "ISO 8601" "ISO" and similar words or abbreviations should be expunged entirely from Wikidata and all it's code and documentation.

Any routines that handle these values internally can do what they want, but the interface should be clear.

I agree with the above.

Either we fix it so all dates presented by Wikibase are actually ISO 8601 compliant or just stop saying that its 8601 compliant!

Uh, what? What about fixing the documentation? It's marked as "This document is a draft, and should not be assumed to represent the ultimate structure" anyway.

My original question is, unfortunately, not mentioned in your responses: What makes you think ISO strictly requires a fixed number of digits? Even if it suggests something like that, who says such a restriction needs to be uniform across all of Wikidata and can't be different for, for example, different items? I think it's our responsibility to define the borders of such a restriction and not something an ISO standard dictates, especially since it can't dictate the number of digits anyway.

Uh, what? What about fixing the documentation? It's marked as "This document is a draft, and should not be assumed to represent the ultimate structure" anyway.

My original question is, unfortunately, not mentioned in your responses: What makes you think ISO strictly requires a fixed number of digits? Even if it suggests something like that, who says such a restriction needs to be uniform across all of Wikidata and can't be different for, for example, different items? I think it's our responsibility to define the borders of such a restriction and not something an ISO standard dictates, especially since it can't dictate the number of digits anyway.

I answered your question on Jan 26, 22:38, UTC. That comment contains a quote from ISO 8601. I suggest that any developer who is consuming or creating data that purports to comply with ISO 8601 should be in possession of that specification and have read it. There is currently a copy available at https://archive.org/details/pdfy-9p5vLWOVotIh-lDV but I'm not sure if that was made available in compliance with copyright laws, so it would be better to purchase a copy from ISO or one of their affiliates in your country. It costs over $100.

No, the question where a fixed number of digits is required is not answered.

Quote: "calendar year is, unless specified otherwise, represented by four digits". Great. We do specify otherwise: Between 1 and 16 digits (additionally, we may prefer padding to 4 digits for convenience). Done. This doesn't mean we can't call it an ISO time stamp.

ISO 8601 says "The interchange parties shall agree the additional number of digits in the time element year." I believe thiemowmde is incorrect in claiming that this can mean the data exchange partners can agree to a variable number of digits.

Other parts of the standard make it clear that the agreement is to the additional digits beyond 4 that are to be specified. Years with 1 to 3 digits are absolutely non-compliant in every case.

To understand where the standard is coming from, you should understand that it support both a basic and extended format. An example of a basic format with more than 4 digits for the year is given on page 27 of the standard: +0119850412. An example of an extended format for the same date is +011985-04-12.

Page 14 makes it clear why it is absolutely mandatory to specify a fixed number of digits. In the examples on that page, it has been agreed to provide two additional digits, or six digits altogether, for the year. The year 1985 may be represented +001985. The century of the 1900s may be represented +0019; since it is agreed there are six digits in a year, the standard demands the recipient interpret +0019 as a case where the exact year is unknown, or a case where it is sufficient to know the century and the exact year is a don't care. If you don't know how many digits the year must contain, you can't tell the difference between a year and a century.

Keep in mind that the whole point of using a standard for information interchange is to allow the use of any parser that correctly parses ISO 8601. Requiring the data consumer to write a new parser that parses a quasi-ISO 8601 Wikidata proprietary format defeats the whole idea of following a standard.

The +0019 example is irrelevant for what we do.

Code that does string.sub(d, 9, 18) is just wrong, no matter how you look at it, and can't be of any relevance for what we do. Think about it. What does it do if it processes the value -00000042000-01-01T00:00:00Z. It extracts "2000-01-01". Fail.

Writing a proper parser is as trivial as it can be: /^[-+](\d+)-(\d+)-(\d+)T(\d+):(\d+):(\d+)/. We guarantee there will always be a sign character, we guarantee there will always be separation characters (-, T and :). We do not guarantee the time zone "Z" will always be there. We may support different time zones in the future. And we do not guarantee the year will always have the same fixed number of digits. We currently try to pad everything to 16 digits but we can't guarantee. This is not how Wikibase works.

As for

The +0019 example is irrelevant for what we do.

It is somewhat relevant because the data consumer may be using a pre-written ISO 8601 parser that supports +0019 as a century for year length 6, and must know the year length in order to do so. If we emit data that can't be read by a standard-compliant parser we are non-compliant.

It seems abundantly clear we will not accept every possible ISO 8601 input. We should create a formal profile of ISO 8601 showing what we will accept. Our profile could be more relaxed on input, but not on output.

As for not making guarantees, if we violate our own documentation, the message to data consumers is we will emit something that vaguely resembles a date, you figure it out on a case-by-case basis. In other words, the only way to read a date is human inspection.

The +0019 example is irrelevant for what we do.

Code that does string.sub(d, 9, 18) is just wrong, no matter how you look at it, and can't be of any relevance for what we do. Think about it. What does it do if it processes the value -00000042000-01-01T00:00:00Z. It extracts "2000-01-01". Fail.

Writing a proper parser is as trivial as it can be: /^[-+](\d+)-(\d+)-(\d+)T(\d+):(\d+):(\d+)/. We guarantee there will always be a sign character, we guarantee there will always be separation characters (-, T and :). We do not guarantee the time zone "Z" will always be there. We may support different time zones in the future. And we do not guarantee the year will always have the same fixed number of digits. We currently try to pad everything to 16 digits but we can't guarantee. This is not how Wikibase works.

Please see my proposed ISO 8601 profile for Wikidata. I believe the points in that user page need to be addressed, even if we decide on a different result than I propose.

daniel subscribed.

All linked pull requests seem to be merged

Change 200845 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Update Special:ListDatatypes for TimeValue

https://gerrit.wikimedia.org/r/200845

Change 200845 merged by jenkins-bot:
Update Special:ListDatatypes for TimeValue

https://gerrit.wikimedia.org/r/200845