Page MenuHomePhabricator

WDQS date handling produces errors for Julian dates
Open, Needs TriagePublic

Description

WDQS holds dates as xsd:dateTime using the proleptic Gregorian calendar. - https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Time

Julian calendar dates entered into Wikidata are stored as such in JSON.

They are converted to proleptic Gregorian calendar dates for WDQS

However, WDQS stores the original - Julian - calendar as the value for wikibase:timeCalendarModel

It follows that a Julian date entered and displayed as expected in Wikidata - such as https://www.wikidata.org/wiki/Q16931292#P571 - 22 June 1498 - is in WDQS represented by the following value set:

simplevalue (ps:P571) - 1 July 1498
value (psv:P571/wikibase:timeValue - 1 July 1498
calendar (psv:P571/wikibase:timeCalendarModel) - wd:Q1985786 (proleptic Julian calendar)

By any reasonable definition, this is an error. WDQS is representing the value as 1 July 1498 Julian, when, at best, it should be 1 July 1498 Gregorian, and ideally should be 22 June 1498 Julian.

I think this date handling needs a rethink, perhaps along the line of:

BDD
given: Julian date in wikidata
when: WDQS reports on the date
then:

  • ps:Pnnn value should be the Julian date - 22 June 1498
  • psv:Pnnn/wikibase:timeValue should be the Julian date - 22 June 1498
  • psv:Pnnn/wikibase:timeCalendarModel) should be wd:Q1985786 (proleptic Julian calendar)
  • psn:Pnnn/wikibase:timeValue should be the Gregorian date - 1 July 1498
  • psn:Pnnn/wikibase:timeCalendarModel) should be wd:Q1985727 (proleptic Gregorian calendar)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don’t see the problem. The exact value is "1498-07-01T00:00:00Z"^^xsd:dateTime (query – the datatype is crucial). XSD date and time values follow ISO 8601, which requires the Gregorian or proleptic Gregorian calendar. So the datatype already tells you how to interpret the value, and the wikibase:timeCalendarModel is therefore the calendar model of the value before it was converted to xsd:dateTime. When the RDF exporter encounters a date that it can’t convert to xsd:datetime, it emits a plain string instead – then, and only then, is the wikibase:timeCalendarModel also the calendar model of that string.

Wikibase or the query service aren’t “representing the value as 1 July 1498 Julian” – they’re just representing the value as some triples. It’s your interpretation of those triples that’s flawed, as far as I understand: “1 July 1498 (proleptic Gregorian), originally specified as Julian” would be a closer English rendition of them, I believe.

It might still be useful to also include the original, unconverted value in the RDF export (as a string, not an xsd:dateTime). But I don’t think there’s anything wrong with the current representation.

Hello. I just ran into the same issue.

For comparison, look at JSON instead: http://www.wikidata.org/entity/Q16931292.json
There, the "value" object contains the following fields:

time: "+1498-06-22T00:00:00Z"
calendarmodel: "http://www.wikidata.org/entity/Q1985786"

This means that JSON contains the original value, together with the calendar model to which the value corresponds.

I do understand the logic provided by @Lucas_Werkmeister_WMDE. However, as developer, I missed the fact that in RDF, the wikibase:timeValue field contains the converted-to-Gregorian value instead of the value corresponding to the calendar - contrary to what JSON does.

As developer, in certain situations, I'd like access to the original - unconverted - value, to faithfully extract the value entered in the user interface. This does not seem possible in RDF at the moment.

I propose to add a field wikibase:timeValueCorrespondingToOriginalCalendar (better name to be discussed).

That addition would make sense to me too (maybe we should edit the task title and description). We should probably only define it for time values where that string differs from the string of the wikibase:timeValue, to save space; people who always wanted the original string could then use a snippet like this:

?timeValue wikibase:timeValue ?converted.
OPTIONAL { ?timeValue wikibase:timeValueOriginal ?original_. }
BIND(COALESCE(?original_, STR(?converted)) AS ?original)

That said, currently about 20% of time values are non-proleptic Gregorian – 71095 out of 339705 (query), a larger proportion than I expected – so maybe we should just add the triple to all time values and it wouldn’t take up that much extra space? @Gehel any thoughts on this?

A way to get access in WDQS to "original calendar" values would really be helpful.

At the moment, we're in an odd situation where the dates in WDQS are technically correct by ISO 8601 (I think), but most users interested in looking at historic data won't be familiar with the convention of ISO 8601, and will either use them without realising they're not being displayed in the calendar they expect, or spot that they're all wrong and get confused/irritated/etc. (Worse, people may change dates to be incorrectly marked as Gregorian in order to make them "show up" correctly. I haven't seen much of this, thankfully, but I am sure it does happen).

I don't think defaulting to Gregorian is a problem in and of itself, but we do need a way to bypass it. Ideally, I think, what WDQS would be able to do is:

  • if asked, display a date in its original calendar schema, and tell you what that calendar is (either as an additional value or with something like the WD superscript)
  • if asked, render a date in a specified calendar schema, either Julian or Gregorian, and tell you which one is being displayed (either as an additional value or with something like the WD superscript)

The first of these is the key problem, as it affects every pre-1582 date; for these, WDQS cannot currently display a human-readable date that is what a human would expect.

The second would be really helpful during periods where some countries use it and some don't, as there will be queries where you're legitimately expecting a mix of calendars in the responses and (depending on context) may want to standardise a timeline on Julian rather than Gregorian. But it's not as vital.

Gehel triaged this task as Medium priority.Sep 15 2020, 8:04 AM

After looking at this afresh I think part of my first suggestion is moot: it is possible to get an indication of the calendar model in use for a given statement by using wikibase:timeCalendarModel (see eg https://w.wiki/5MPi for the "point in time" of the October Revolution, which currently has both Julian and Gregorian dates), I hadn't properly understood this.

However the actual value given for both statements remains proleptic Gregorian - so you can't see the date as originally rendered for the user, and if you wanted to work out and display the Julian date for a user you would have to do something fancy like https://w.wiki/5MRj (hardcoding the offset based on the year). It would be great if this could be done server-side, though. I like the idea of wikibase:timeValueCorrespondingToOriginalCalendar (maybe just wikibase:timeValueJulian since it is the only one we are likely to support for the forseeable?)

This task was discussed in the Bug Triage Hour at the Wikidata Data Quality Days 2022:

  • Julian dates reflected in a very confusing way on WDQS.
  • Considered harmful to data-checking, data re-use, data de-duplication, and data round-tripping.

The task was also raised on Project Chat last month:

Suggestion from today's bug triage hour: create a new SERVICE that makes the date formatting and conversions easier by handling precision and calendar model instead of having to do it by hand in the query.

Also raised in the triage-hour discussion was ticket: T207705 "Implement the Extended Date/Time Format Specification" (EDTF)

EDTF (info) is an extension to ISO 8601 (specifically, part of ISO 8601-2:2019), developed by the Library of Congress with other bibliographic institutions, which defines a format for serialising imprecise or complex dates into strings.
It is now increasingly in use in the wild -- for example in the cataloguing data of GLAM institutions, especially library systems; in applications like Zotero; in communities such as the Citation Style Language community; and elsewhere. Giving wikidata the facility to be able to ingest, store, display, output, and round-trip EDTF dates would be of significant value in itself.

On the face of it (as @Jc3s5h has repeatedly noted on T207705), implementing EDTF doesn't necessarily help with the Julian/Gregorian difficulty. EDTF is an extension of xsd:dateTime, and like xsd:dateTime an edtf:EDTF date by both construction and definition represents a Gregorian date (or a more complex entity built from Gregorian dates).

However, I suggest in this contribution to that ticket (Jul 20, 2022), there may be a way forward. As well as implementing dates with an rdf dataype ^^edtf:EDTF , we could also instead give appropriate dates an alternate rdf datatype ^^wb:EDTF-J. These dates would be almost identical to the ^^edtf:EDTF dates -- in fact the string parts would be exactly identical, representing the same Gregorian day or same range of Gregorian days -- but the different ^^wb:EDTF-J datatype would represent a request, that the wdqs onscreen rendering could pick up on, to translate the date to the corresponding date in the Julian calendar and display that if possible. (Similar to the meaning of wikibase:timeCalendarModel = Julian in a wikibase:time node; but the wdqs gui will not usually have access to that).

It occurs to me that the same approach could be used for xsd:dateTime dates too, changing the RDF dump so that eg wdt:P569 statements were written with an RDF type ^^wb:dateTime-J if one wanted to attach a request that they should be rendered as their Julian equivalent.

I think this might be slightly more involved to implement (though I could be wrong), as I think one would want to make sure that the ^^wb:dateTime-J dates were treated by Blazegraph internally in the same way as ^^xsd:dateTime dates (ie translated internally into milliseconds, and with functions like day() month() and year(). subtraction, <, and > all returning the same results. But I could be wrong, and with the magic of subclassing in Java it might all be possible without too much pain, so I think could be worth investigating.

Otherwise, failing that, if we did implement ^^edtf:EDTF and ^^wb:EDTF-J as per T207705, that could give a way to allow WDQS to correctly render and properly indicate Gregorian and Julian dates, at least for statements with triples with those datatypes.

In my opinion this is a serious problem, since it falsifies a relevant amount of the dates extracted from Wikidata items through queries; and it is impossible to extract native Julian dates, but they can only be extracted in their conversion to Gregorian dates (that, being automatic, can be sometimes faulty). The solution proposed, viz. keeping Julian date in psv: and adopting psn: for automatic Gregorian conversion, seems very good to me.
If there is no objection, I think this ticket should be raised to high priority.

Epidosis raised the priority of this task from Medium to High.Jul 12 2023, 1:05 PM
Delane13 raised the priority of this task from High to Needs Triage.Aug 8 2023, 1:38 PM

When running a query that collects dates such as this one : https://w.wiki/6rSi (Collects the date of death of accused witches).

Date values in Wikidata using the Julian calendar are being displayed/converted to Gregorian calendar in the Wikidata Query service results. This 'conversion' is adding 10 days in the query results. e.g. if date of death = 4 January 1647 (Julian calendar) on Wikidata item (Q43395584) then it is displayed in query results as 1647-01-14T00:00:00Z. https://w.wiki/6rSi.

This means lots of historical dates are being misrepresented through queries and is retrofitting a modern calendar onto historic temporal data. There is no way to extract Julian dates from Wikidata.

I am working on this website https://witches.is.ed.ac.uk/ which visualises data from the Survey of Scottish Witchcraft that has been added to Wikidata. It uses queries to extract infromation from Wikidata that is then visualised on our site. When this conversion is taking place the dates are being misunderstood by our users. For example, the accused witch Isobel Gowdie she lived and died under the Julian calendar and not the Gregorian calender.

I have been also going through the process of checking the data against the original Survey of Scottish Witchcraft and this has added a lot of difficulty and I'm sure that would be the case for lots of people working on similar projects and data checking in general.

I think this is quite a serious problem for these reasons so I have changed the priority from Medium to High.

I won't take issue issue with Delane13 changing the priority. I do think the general case for visualizing data in time is that the data about the events was originally recorded in a mixture of calendars, and usually there will be a desire to retrieve the calendar information so all the results of a query are in the same calendar so that they can be compared to each other. I think it is commonplace among researchers to have to convert dates themselves into their desired calendar and format. When large amounts of data are being handled, I think researchers will need automated tools for calendar conversions.

In our case we have all our dates from the Julian calendar.... this means when our site is pulling the dates of Scottish witchcraft investigations from Wikidata, even though they are in Julian calendar, they are auto converted to Gregorian and displaying as such. We could write a script to reconvert back to Julian calendar but this seems a) an unnecessary extra step and b) is additionally complicated given that the auto conversion only seems to apply to date precisions of DD-MM-YY and not MM-YY (month-year only precision , which is still returned as a full date - 1644-07-01T00:00:00Z) and nor YY (year only precision ,which is returned in format 1662-01-01T00:00:00Z). This added complication means we'd need to figure out how to write a script to handle the precision issue as many of the thousands of dates we are displaying have different precisions (historical dates such as these means we sometimes only have a year to go by or a month and a year but happily there are many also with the full day-month-year precision). The nature of the format that the queries returns gives no indication of the precision therefore is hard to the distinguish when the precision is MM-YY or when is actually the first of the moth (01-MM-YY). This complication makes it quite hard for a script to handle. Any input on a way forward would be beneficial as at the moment we are trying to work out how to undo the auto conversion (and handle all the precision exceptions) and display our dates in the Julian calendar as recorded.

@Delane13 It is possible to do this conversion within the SPARQL service - approximately, you would want something like this query, tweaking the calendar offsets as needed. It is a mess, but it does seem to work. Note that it's possible to retrieve the precision, it's just not default-displayed (much like calendars...)

As developer, in certain situations, I'd like access to the original - unconverted - value, to faithfully extract the value entered in the user interface. This does not seem possible in RDF at the moment.

I propose to add a field wikibase:timeValueCorrespondingToOriginalCalendar (better name to be discussed).

That addition would make sense to me too (maybe we should edit the task title and description). We should probably only define it for time values where that string differs from the string of the wikibase:timeValue, to save space; people who always wanted the original string could then use a snippet like this:

?timeValue wikibase:timeValue ?converted.
OPTIONAL { ?timeValue wikibase:timeValueOriginal ?original_. }
BIND(COALESCE(?original_, STR(?converted)) AS ?original)

That said, currently about 20% of time values are non-proleptic Gregorian – 71095 out of 339705 (query), a larger proportion than I expected – so maybe we should just add the triple to all time values and it wouldn’t take up that much extra space? @Gehel any thoughts on this?

Update on these numbers, as the topic came up again during the Wikidata meetup at Wikimania 2024: there are now 432783 time values in total, 124252 of which (28.7%) use a calendar other than proleptic Gregorian. (Same query as above.) Compared to the total number of triples, 16200365876 as of this writing, either number is IMHO completely insignificant (0.00267%). So I would be inclined to just add the original time stamp string to all time values, as it makes queries simpler.

(Note that all the above analysis counts time stamps or calendar models of distinct time values, not time-valued statements / qualifiers / references. Many of these time values are going to be used more than once, with different statements pointing to the same full value node. However, since the extra triple would only be added once per time value, even if the time value is used more than once, I think the above numbers are the ones we should worry about, and counting how many times each value is used would be a distraction.)

I'm having trouble deciphering the query results posted by @Lucas_Werkmeister_WMDE above. The value 432,783 seems low; is that just dates before 15 October 1582? Is 71,095 the number of Julian calendar dates? (I follow the US convention of a comma separating groups of three digits and a dot [.] being a decimal point.)

I think it seems low because it’s the number of distinct time values. The time value 2024-08-08 alone is used as the “retrieved” date of almost ten thousand references (and a couple dozen points in time, etc.), but they all share the same value node (this one), so if we add a wikibase:timeValueOriginal triple to this node, that’s still only one more triple.

As a very rough estimate, there have been 365.2425×2000=730485 (730k) days in the past 2000 years, which is in the same ballpark. (Wikidata covers more than the past 2000 years, but not all dates have had a notable event either, so to me it seems plausible that the number above is somewhat below 730k.)

That's very interesting - there have only been about 162k days since the Gregorian calendar came in (62k days of which were in the "dual calendar" period) which even allowing for some year/month precision and some future dates, strongly suggests a pretty significant chunk of those Gregorian-marked dates are going to be during the Julian era. Will be quite the cleanup challenge at some point!

I really like the suggested solution of making "original" and "standardised" values available as triples through the query service, if that would not be unreasonably complicated.