Page MenuHomePhabricator

WDQS date handling produces errors for Julian dates
Open, MediumPublic

Description

WDQS holds dates as xsd:dateTime using the proleptic Gregorian calendar. - https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Time

Julian calendar dates entered into Wikidata are stored as such in JSON.

They are converted to proleptic Gregorian calendar dates for WDQS

However, WDQS stores the original - Julian - calendar as the value for wikibase:timeCalendarModel

It follows that a Julian date entered and displayed as expected in Wikidata - such as https://www.wikidata.org/wiki/Q16931292#P571 - 22 June 1498 - is in WDQS represented by the following value set:

simplevalue (ps:P571) - 1 July 1498
value (psv:P571/wikibase:timeValue - 1 July 1498
calendar (psv:P571/wikibase:timeCalendarModel) - wd:Q1985786 (proleptic Julian calendar)

By any reasonable definition, this is an error. WDQS is representing the value as 1 July 1498 Julian, when, at best, it should be 1 July 1498 Gregorian, and ideally should be 22 June 1498 Julian.

I think this date handling needs a rethink, perhaps along the line of:

BDD
given: Julian date in wikidata
when: WDQS reports on the date
then:

  • ps:Pnnn value should be the Julian date - 22 June 1498
  • psv:Pnnn/wikibase:timeValue should be the Julian date - 22 June 1498
  • psv:Pnnn/wikibase:timeCalendarModel) should be wd:Q1985786 (proleptic Julian calendar)
  • psn:Pnnn/wikibase:timeValue should be the Gregorian date - 1 July 1498
  • psn:Pnnn/wikibase:timeCalendarModel) should be wd:Q1985727 (proleptic Gregorian calendar)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don’t see the problem. The exact value is "1498-07-01T00:00:00Z"^^xsd:dateTime (query – the datatype is crucial). XSD date and time values follow ISO 8601, which requires the Gregorian or proleptic Gregorian calendar. So the datatype already tells you how to interpret the value, and the wikibase:timeCalendarModel is therefore the calendar model of the value before it was converted to xsd:dateTime. When the RDF exporter encounters a date that it can’t convert to xsd:datetime, it emits a plain string instead – then, and only then, is the wikibase:timeCalendarModel also the calendar model of that string.

Wikibase or the query service aren’t “representing the value as 1 July 1498 Julian” – they’re just representing the value as some triples. It’s your interpretation of those triples that’s flawed, as far as I understand: “1 July 1498 (proleptic Gregorian), originally specified as Julian” would be a closer English rendition of them, I believe.

It might still be useful to also include the original, unconverted value in the RDF export (as a string, not an xsd:dateTime). But I don’t think there’s anything wrong with the current representation.

Hello. I just ran into the same issue.

For comparison, look at JSON instead: http://www.wikidata.org/entity/Q16931292.json
There, the "value" object contains the following fields:

time: "+1498-06-22T00:00:00Z"
calendarmodel: "http://www.wikidata.org/entity/Q1985786"

This means that JSON contains the original value, together with the calendar model to which the value corresponds.

I do understand the logic provided by @Lucas_Werkmeister_WMDE. However, as developer, I missed the fact that in RDF, the wikibase:timeValue field contains the converted-to-Gregorian value instead of the value corresponding to the calendar - contrary to what JSON does.

As developer, in certain situations, I'd like access to the original - unconverted - value, to faithfully extract the value entered in the user interface. This does not seem possible in RDF at the moment.

I propose to add a field wikibase:timeValueCorrespondingToOriginalCalendar (better name to be discussed).

That addition would make sense to me too (maybe we should edit the task title and description). We should probably only define it for time values where that string differs from the string of the wikibase:timeValue, to save space; people who always wanted the original string could then use a snippet like this:

?timeValue wikibase:timeValue ?converted.
OPTIONAL { ?timeValue wikibase:timeValueOriginal ?original_. }
BIND(COALESCE(?original_, STR(?converted)) AS ?original)

That said, currently about 20% of time values are non-proleptic Gregorian – 71095 out of 339705 (query), a larger proportion than I expected – so maybe we should just add the triple to all time values and it wouldn’t take up that much extra space? @Gehel any thoughts on this?

A way to get access in WDQS to "original calendar" values would really be helpful.

At the moment, we're in an odd situation where the dates in WDQS are technically correct by ISO 8601 (I think), but most users interested in looking at historic data won't be familiar with the convention of ISO 8601, and will either use them without realising they're not being displayed in the calendar they expect, or spot that they're all wrong and get confused/irritated/etc. (Worse, people may change dates to be incorrectly marked as Gregorian in order to make them "show up" correctly. I haven't seen much of this, thankfully, but I am sure it does happen).

I don't think defaulting to Gregorian is a problem in and of itself, but we do need a way to bypass it. Ideally, I think, what WDQS would be able to do is:

  • if asked, display a date in its original calendar schema, and tell you what that calendar is (either as an additional value or with something like the WD superscript)
  • if asked, render a date in a specified calendar schema, either Julian or Gregorian, and tell you which one is being displayed (either as an additional value or with something like the WD superscript)

The first of these is the key problem, as it affects every pre-1582 date; for these, WDQS cannot currently display a human-readable date that is what a human would expect.

The second would be really helpful during periods where some countries use it and some don't, as there will be queries where you're legitimately expecting a mix of calendars in the responses and (depending on context) may want to standardise a timeline on Julian rather than Gregorian. But it's not as vital.

Gehel triaged this task as Medium priority.Sep 15 2020, 8:04 AM

After looking at this afresh I think part of my first suggestion is moot: it is possible to get an indication of the calendar model in use for a given statement by using wikibase:timeCalendarModel (see eg https://w.wiki/5MPi for the "point in time" of the October Revolution, which currently has both Julian and Gregorian dates), I hadn't properly understood this.

However the actual value given for both statements remains proleptic Gregorian - so you can't see the date as originally rendered for the user, and if you wanted to work out and display the Julian date for a user you would have to do something fancy like https://w.wiki/5MRj (hardcoding the offset based on the year). It would be great if this could be done server-side, though. I like the idea of wikibase:timeValueCorrespondingToOriginalCalendar (maybe just wikibase:timeValueJulian since it is the only one we are likely to support for the forseeable?)

This task was discussed in the Bug Triage Hour at the Wikidata Data Quality Days 2022:

  • Julian dates reflected in a very confusing way on WDQS.
  • Considered harmful to data-checking, data re-use, data de-duplication, and data round-tripping.

The task was also raised on Project Chat last month:

Suggestion from today's bug triage hour: create a new SERVICE that makes the date formatting and conversions easier by handling precision and calendar model instead of having to do it by hand in the query.

Also raised in the triage-hour discussion was ticket: T207705 "Implement the Extended Date/Time Format Specification" (EDTF)

EDTF (info) is an extension to ISO 8601 (specifically, part of ISO 8601-2:2019), developed by the Library of Congress with other bibliographic institutions, which defines a format for serialising imprecise or complex dates into strings.
It is now increasingly in use in the wild -- for example in the cataloguing data of GLAM institutions, especially library systems; in applications like Zotero; in communities such as the Citation Style Language community; and elsewhere. Giving wikidata the facility to be able to ingest, store, display, output, and round-trip EDTF dates would be of significant value in itself.

On the face of it (as @Jc3s5h has repeatedly noted on T207705), implementing EDTF doesn't necessarily help with the Julian/Gregorian difficulty. EDTF is an extension of xsd:dateTime, and like xsd:dateTime an edtf:EDTF date by both construction and definition represents a Gregorian date (or a more complex entity built from Gregorian dates).

However, I suggest in this contribution to that ticket (Jul 20, 2022), there may be a way forward. As well as implementing dates with an rdf dataype ^^edtf:EDTF , we could also instead give appropriate dates an alternate rdf datatype ^^wb:EDTF-J. These dates would be almost identical to the ^^edtf:EDTF dates -- in fact the string parts would be exactly identical, representing the same Gregorian day or same range of Gregorian days -- but the different ^^wb:EDTF-J datatype would represent a request, that the wdqs onscreen rendering could pick up on, to translate the date to the corresponding date in the Julian calendar and display that if possible. (Similar to the meaning of wikibase:timeCalendarModel = Julian in a wikibase:time node; but the wdqs gui will not usually have access to that).

It occurs to me that the same approach could be used for xsd:dateTime dates too, changing the RDF dump so that eg wdt:P569 statements were written with an RDF type ^^wb:dateTime-J if one wanted to attach a request that they should be rendered as their Julian equivalent.

I think this might be slightly more involved to implement (though I could be wrong), as I think one would want to make sure that the ^^wb:dateTime-J dates were treated by Blazegraph internally in the same way as ^^xsd:dateTime dates (ie translated internally into milliseconds, and with functions like day() month() and year(). subtraction, <, and > all returning the same results. But I could be wrong, and with the magic of subclassing in Java it might all be possible without too much pain, so I think could be worth investigating.

Otherwise, failing that, if we did implement ^^edtf:EDTF and ^^wb:EDTF-J as per T207705, that could give a way to allow WDQS to correctly render and properly indicate Gregorian and Julian dates, at least for statements with triples with those datatypes.