Page MenuHomePhabricator

Implement the Extended Date/Time Format Specification
Open, Needs TriagePublic

Assigned To
None
Authored By
Pigsonthewing
Oct 22 2018, 9:22 PM
Referenced Files
None
Tokens
"Like" token, awarded by VladimirAlexiev."Like" token, awarded by Moebeus."Like" token, awarded by SilentSpike."Like" token, awarded by Spinster.

Description

We should adopt the "Extended Date/Time Format Specification" (EDTF) profile [1] for ISO 8601, which allows for, for example, uncertain and vague dates, for use in properties with a "Point in time" datatype.

ISO 8601-2019, due in the middle of that year, is expected to support all of the features of EDTF.

Disclosure: I contributed, partly as a Wikimedian, to the draft of this specification.

Assigning to Lydia initially, for delegation as appropriate ;-)

[1] http://www.loc.gov/standards/datetime/edtf.html

Event Timeline

This specification is a non-starter because Wikidata supports both the Gregorian and Julian calendar while this spec only supports the Gregorian calendar.

This specification is a non-starter because Wikidata supports both the Gregorian and Julian calendar while this spec only supports the Gregorian calendar.

I'm not sure why that would be a problem - maybe there's a problem with what is understood by "support"? Adding support for particular Gregorian dates formatted in a particular way wouldn't affect the ability to set Julian dates as well.

That said, even level 0 support requires time interval support. When I add a date as 1964/2008 this is malformed in the date field. Wikidata expects you to use two separate properties to add a date range i.e. start time and end time. This seems a very big design decision to overhaul. The easiest way to accomplish this would be to support it in the front end with a wikidata specific gadget but it's less than ideal.

This is also true for things like quarters and seasons where you're currently expected to add the qualifier refine date. Rolling both of those features back into the date type in the back end seems difficult.

However, adding support for it in the backend would definitely make things like adding bibliographic data easier, particularly with any APIs that implement this format. We considered implementing an earlier version for citoid but got push back from citation template owners about supporting it and it's been stalled, but if wikidata supported it that'd be a big impetus to use it.

@Mvolz I'm not sure what you mean by front end and back end. In any case, there are several methods that may be used to put information into the database, and several that may be used to extract information from the database. Some of these methods like RDF follow standards set by external organizations and we lack the ability to change them.

Jc3s5h: This is now the third venue in which I have told you that there is absolutely no reason why we couldn't apply this only to Gregorian dates, as intended.

Jc3s5h: This is now the third venue in which I have told you that there is absolutely no reason why we couldn't apply this only to Gregorian dates, as intended.

And I will hunt down every place you have proposed this and make sure this serious disadvantage is mentioned. EDTF is human-readable. Humans have been using the same notation, in many formats and languages for Julian and Gregorian dates for 400 years, with ISO 8601 being one of the rare exceptions. And many ISO 8601 users are ignorant of the no-Julian restriction and therefore make mistakes. This human factors situation is an excellent reason to reject EDTF.

ISO 8601 being one of the rare exceptions

As noted above: "ISO 8601-2019, due in the middle of that year, is expected to support all of the features of EDTF."

Furthermore, as you can read at the linked page: "the EDTF specification is included as a profile of ISO 8601."

ISO 8601 being one of the rare exceptions

As noted above: "ISO 8601-2019, due in the middle of that year, is expected to support all of the features of EDTF."

Furthermore, as you can read at the linked page: "the EDTF specification is included as a profile of ISO 8601."

My meaning was that nearly all notations, such as "July 1, 2018", "1 July 2018" "1 VII 2018" or "Solis Kandis Juliis MMXVIII" can refer to either the Gregorian or Julian calendar; this has been so for 400 years and that is what people are used to. ISO 8601 is one of the few notations that isn't supposed to be used with Julian dates; EDTF is another.

I accidentally bumped upon this, wasn't aware this was proposed. Although I'm one of Wikidata's greatest fans, I've always found its current date/time system very limited, especially when it comes to modeling intervals and uncertainties in an easily machine-readable way. The current 'solution' with qualifiers is, IMO, less than ideal.

This proposal looks like a really good step in the right direction. Although I definitely see how updating the massive amounts of date/time data already in Wikidata to a newer and more refined format will be a PITA, I think it's a better solution for the long run.

A largely complete PHP library (limitations as specified) and Wikibase data type (MediaWiki extension), initially funded by the Luxembourg Ministry of Culture, is now available, with development led by Professional.wiki.

I've been trying it out on WBStack, where it was installed last night. Alas, the spec only includes dates for intervals, not times of day; and there are some issues to address (e.g. large ranges) - some of which have been fixed in a newer version of the software not yet on the site - but overall it looks promising.

The task above could be updated: ISO 8601-2:2019 effectively supported EDTF (https://en.wikipedia.org/wiki/ISO_8601).

The task above could be updated: ISO 8601-2:2019 effectively supported EDTF (https://en.wikipedia.org/wiki/ISO_8601).

I wonder if Epidosis meant "supplanted" rather than "supported". The two standards are similar in spirit, but different.

I oppose ISO 8601 because ISO charges a lot of money for copies of the standard, so many volunteer developers would attempt to use it without buying it, and instead rely on poor quality summaries, and get everything wrong.

I oppose ISO 8601

Your proposed alternative is..?

Your proposed alternative is..?

Perhaps we could find a standards organization that is more interested in having their work used correctly than raking in the bucks. Maybe the Internet Engineering Task Force.

From what I can see with a cursory search, the IETF seem happy to declare extensions to the standard rather than seeking to replace it. Time in a general sense is not their focus; timestamps for Internet protocols (which do not need to express the same vagaries as EDTF) and methods of setting time over a network are.

Regardless, the focus of this issue is to implement a specific format, which is both of interest to some institutions, and also offers features not available in the existing datatype.

There is a reasonable concern that not everyone will be able to contribute fully to development - I had my own problems when filing issues - but it probably does not pose the same issues as, say, integrating a proprietary database engine.

As a practical matter, I suspect those who care the most about correctness with respect to the more obscure parts of the standard are the most likely to have a copy of it, or resources to fund access to it.

As noted by @GreenReaper above, the Wikibase_EDTF wikibase extension should now give a solid basis for building EDTF support on wikibase, allowing EDTF strings to be input, validated, and rendered by the wikibase GUI, if we want to add properties with an EDTF datatype to Wikibase.

A separate but related issue is what adaptations could or should be made to WDQS to support EDTF. (And could this help with issue T159160 "Take account of date precision when displaying dates in WDQS GUI").
Here are some possible steps towards adding EDTF awareness to WDQS:

  1. Extend the GUI output code to recognise objects with rdf type ^^http://id.loc.gov/datatypes/edtf/EDTF (equivalent to ^^edtf:EDTF for short) as representing EDTF dates, and display columns of them in the output in an appropriately readble way for human consumption ("humanization"). Some of the internationalisation developed for the Commons Other date template and underlying Complex date Lua module, the Wikibase EDTF project, or other EDTF implementations may help with translations. The standard defines different levels of EDTF compliance; it might be reasonable initially to support only a subset of the standard initially. Functionality could be tested by building suitable strings in SPARQL queries, using the strdt() function to cast them to type edtf:EDTF.
  1. Once it is possible for the GUI to interpret and display EDTF dates, add triples to the SPARQL triplestore and the RDF dump with new prefixes (perhaps wdtn: and psn:) to add EDTF-valued triples, for all existing statements involving dates, eg wd:Q692 wdtn:P569 [1564-04..1564-04-26]^^edtf:EDTF, with updates fed to the wdqs updater when the underlying wikibase statements are edited. The ability to use these forms in queries should substantially address the T159160 problem, without affecting existing queries.
  1. The new triples should not replace the existing wdt: and ps: triples, nor existing psv: triples with their wikibase:timevalued nodes, but exist alongside them.

    The EDTF format, even if normalised through eg the edtf.js javascript package used by Zotero, IMO contains too many ways to express more-or-less the same thing, which counts against efficient indexing or retrieval. Its ability to represent complex and approximate dates does not lend itself to the kind of range-safe fast indexing that can be used to retrieve exact dates (represented internally by Blazegraph as exact microseconds since a particular moment). Similarly, if one wants to find dates with a year-precision or a month-precision, the existing wikibase:timevalued nodes express that information directly as RDF statements which are all indexed, whereas to extract corresponding EDTF statements would require much slower assembling of the whole dataset then filtering with string operations and/or regexes. Even edtf shops are trying to find good ways to model the format in RDF (example). Given that we already now have a quite developed model for complex dates as statements, IMO it would not make sense to give it up (contra @Spinster?). Also I suspect there are area where it achieves more nuance and exactness than the current version of EDTF.

    Instead I would suggest that we keep the existing data model on wikidata as the primary way of representing complex dates. I would suggest that the only EDTF valued-property we should would be a single EDTF date stated as. Input should be allowed eg of the form P571 inception = //somevalue//, qualifier: EDTF date stated as = ... Bots should then translate this into wikidata dates and qualifiers, moving the EDTF date stated as = ... qualifier to the reference when this is done. EDTF-valued wdtn: and psn: triples should be constructed from the wikidata statement and its qualifiers, to be accessible via SPARQL, LDF, rdf dumps, or an API request. This would allow data to be edited and round-tripped back to institutions, and input to be compared with output, while maintaining a single preferred unified model in wikidata for representing complex dates.
  1. At the present time, with our use of Blazegraph perhaps approaching end-of-life, an aversion to implementing new services or functions specific to that platform is understandable. However I believe an exception should be made for a pair of functions to return as an xsd:dateTime the minimum and the maximum date that a given edtf:EDTF value could represent (similar to equivalent functionality found in edtf.js,and I think also a number of other edtf implementations). This would allow the results of a query outputting an edtf:EDTF value to be easily ORDERed, by the earliest possible date, or the latest possible date, or perhaps the midpoint of the two, allowing values with eg 'century' precision to be returned in their proper place as desired, without having to implement non-trivial < and > comparisons between edtf values. It would also make it easy for queries to ask whether the ranges of two edtf values overlap (and are therefore compatible); or to filter for an edtf range including a particular date. It would also be useful to be able to cast an edtf:EDTF value to an xsd:dateTime (i.e. overload the xsd:dateTime() function), if it represents a particular day (or return an unbound value otherwise). This should require no more than just padding such an edtf:EDTF value with a few more zeros.
  1. One further issue, as @Jc3s5h notes above, is that by definition edtf:EDTF values represent Gregorian dates, and only ever Gregorian dates (or ranges of them). But we may want to represent Julian dates. For example, the typically given date for Shakespeare's birth that I used as an example above is actually a Julian date, so his correct edtf date is actually different to what I quoted above. Already with xsd:dateTime` dates this gives rise to T246731 "WDQS date handling produces errors for Julian dates", significant confusion, and difficulty with input-output comparison and data round-tripping.

    For the new EDTF date triples there may be a workable way forward on this, if for date statements with a wikibase:timeCalendarModel = Julian in their wikibase:time node(s) we were to give the new wdtn: and psn: rdf triples a datatype ^^wb:EDTF-J rather than ^^edtf:EDTF. The string parts of the two values would be identical for a particular date, representing a Gregorian complex date, and could be acted on in exactly the same way by eg the maximum and minimum functions, so they would be sort together appropriately. But the ^^wb:EDTF-J datatype could be picked up by the query-service GUI as a request to convert and format the date as a Julian date for display (marked as such), allowing Julian dates to be entered, displayed on wikibase, and displayed on WDQS all as Julian dates, without the confusion discussed in T246731. Some issues might remain: for example, representing Shakespeare's birthday month of (Julian) April 1564 as (Gregorian) April 1564 with a flag to "please convert to Julian" is not a perfect solution, and may cause anomalies when it comes to calculated inclusions and overlaps -- but it would be a close mirror of how such Julian dates are currently represented on Wikidata. Also query writers would find that dates in downloadable TSV files etc would still be Gregorian (presumably), albeit nicely displayed as Julian on screen. A partial workaround for that might be to offer a julian() function that query-writers could call, that would be able to translate a Gregorian xsd:dateTime into a Julian string suitable for their spreadsheets, onward analysis etc. An algorithm for such a converter can be found here; and also presumably (perhaps with more generality) within the existing wikbase code; making it available from SPARQL would be useful.

With EDTF now increasingly in use in the wild -- eg in the cataloguing data of GLAM institutions, especially library systems; in applications like eg Zotero; and eg in the Citation Style Language community -- and also with the availability of the Wikibase_EDTF wikibase extension on the wikibase side to build on, an ability for wikidata to be able to ingest, store, display, output, and round-trip EDTF dates would now seem to be very timely. If it could also help us get round our (IMO acute) T159160 date precision and T246731 Julian date WDQS display difficulties that could be a most excellent bonus. The steps above I believe are realistic and achievable, but I think could make a lot of difference. Given that external support made the Wikibase EDTF extension possible, it is not impossible that support might be discoverable on the wdqs side too.

RE:request for feedback on #3:

I would want that in the typical case the user interface shows the EDTF value and that it's easy to enter dates in that format. In most cases, a user should just be able to input dates without worrying about qualifiers. The user should not have to worry about Wikidata having its own time system that differs from the ISO-supported EDTF.

As far as what happens on the backend, I can see that it might make sense sometimes to store two values. If we for example have dates in the Julian calendar storing that date both in the Julian calendar and also in Gregorian EDTF would make it easier to sort dates while still being able to display the Julian date.

I believe that the longer Wikidata doesn't fix this issue the more times people will make errors when entering dates because they are not thinking about the conversion between how Wikidata's idea of time differs from the ISO standard.

I believe an exception should be made for a pair of functions to return as an xsd:dateTime the minimum and the maximum date that a given edtf:EDTF value could represent (similar to equivalent functionality found in edtf.js,and I think also a number of other edtf implementations). This would allow the results of a query outputting an edtf:EDTF value to be easily ORDERed, by the earliest possible date, or the latest possible date, or perhaps the midpoint of the two...

I'd like to highlight this particular function as I also believe it to be useful. There was discussion in the extension itself about adding an xsd:date for the start and/or end of an interval, but others felt that this would not be correct as it is not actually a point in time. Still, it would be very useful for things like event schedules (I am not sure that middle would be suitable, since it is not a time specified by the interval at all, but others might disagree; perhaps it could be achieved easily by adding both together and dividing by two).

Thanks everyone and especially @Jheald for the valuable info.

But what EDTF levels are we talking here? Because higher EDTF levels do not translate to intervals. When you start masking individual digits (or yyyy-mm-dd components), it becomes discontinuous.

So "minimum and the maximum date that a given edtf:EDTF value could represent" becomes only a rough guide, and then you need to pore over the range and check painstakingly ...
At what granularity? When you combine with "exponential years", it becomes very difficult.