Page MenuHomePhabricator

[Bug] RDF export misses extreme values with day precision
Closed, ResolvedPublic

Description

Because the Julian to Gregorian conversion we do is currently based on PHP's limited cal_to_jd and jdtogregorian functions, many dates the Wikibase data model supports are currently not accessible from the RDF export and the QueryService.

Trivial solution: Based on the assumption that a date value at, let's say, 10.000 years BCE with day precision is pointless anyway, we can just export the year. This is already much better than no export.

Reported here: https://www.mediawiki.org/wiki/User_talk:Thiemo_Mättig_(WMDE)#RDF_Julian_Gregorian_conversion

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 312214 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Cleanup DateTimeValueCleanerTest

https://gerrit.wikimedia.org/r/312214

Change 312215 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Fix (Julian)DateTimeValueCleaner for extreme years

https://gerrit.wikimedia.org/r/312215

thiemowmde moved this task from Proposed to Review on the Wikidata-Sprint-2016-09-21 board.
thiemowmde moved this task from Needs triage to WDQS on the Discovery-ARCHIVED board.
thiemowmde moved this task from incoming to in progress on the Wikidata board.
thiemowmde moved this task from Incoming to SDAW on the Wikidata-Query-Service board.

I think this idea makes a lot of sense. While I think all far-ago dates that have day precision are most probably data errors (you can't have May 4th in 200000BC, really, at least not using Earthly calendars :) handling these errors in more sane manner is good. However, I think we need then to change precision, if we operate under year precision, we can't just take year and then claim day precision. It'd be claiming false data.

I think this idea makes a lot of sense. While I think all far-ago dates that have day precision are most probably data errors (you can't have May 4th in 200000BC, really, at least not using Earthly calendars :) handling these errors in more sane manner is good. However, I think we need then to change precision, if we operate under year precision, we can't just take year and then claim day precision. It'd be claiming false data.

A person using the RDF may just want the best idea of the date that the RDF processing can give. But the person might also be performing quality control on the underlying data that flowed into the internal representation to RDF conversion process. It would be nice to somehow indicate that the date had to be approximated due to limitations of the RDF conversion process. (Perhaps leave it for the data consumer to figure out for him/her/itself that May 4th in 200000BC is silly?)

Perhaps leave it for the data consumer to figure out for him/her/itself that May 4th in 200000BC is silly

It is an option, and here we have a tradeoff, of either:

A. Try to represent the wrong data in a way that they would be useful, preserving the useful part and dropping the bad part, or
B. Don't try to fix the data and present it as is, broken if it so happens

Depending on your goals, you may want either of them, or maybe both - i.e. have both original and normalized values. We may try doing that too.

Change 312214 merged by jenkins-bot:
Cleanup DateTimeValueCleanerTest

https://gerrit.wikimedia.org/r/312214

Change 313984 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Add extreme Julian date test to TimeRdfBuilderTest

https://gerrit.wikimedia.org/r/313984

thiemowmde moved this task from Review to Done on the Wikidata-Sprint-2016-09-21 board.

Change 312215 merged by jenkins-bot:
Fix (Julian)DateTimeValueCleaner for extreme years

https://gerrit.wikimedia.org/r/312215

Change 314240 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Cleaner JulianDateTimeValueCleaner implementation

https://gerrit.wikimedia.org/r/314240

Change 313984 merged by jenkins-bot:
Skip Julian to Gregorian date conversion for extreme, unsupported dates

https://gerrit.wikimedia.org/r/313984

Change 314240 merged by jenkins-bot:
Cleaner JulianDateTimeValueCleaner implementation

https://gerrit.wikimedia.org/r/314240

I have been experimenting with various extreme values of the Julian year, as great as positive 10 billion, and am finding erratic results for the value of the julian date ($jd) and the Gregorian date ($gregorian). I found that for values that are too large, $jd could be 0, negative, or of the same order of magnitude as the year (when it should be about 365 times the year). Also, for values that are too large, $gregorian might or might not be "0/0/0".

I should add that I don't have another tool at my disposal to independently convert Julian to Gregorian for enormous dates, so I can't say for sure if the correction in PHP is correct between the years 101,994 and 1,465,071, only that there are no obvious problems.

I found the largest Julian calendar date that does not produce any obvious problem is y = rMW1465072dbe23, m = 9, d = 17. This results in $jd = 536,838,867. The value of $gregorian is "10/17/1,465,102" (thousand separators added).

A simple solution, if you don't mind not converting less than a year's worth of dates in the distant future, is to add a test that the Julian year must be < rMW1465072dbe23.

A simple solution, if you don't mind not converting less than a year's worth of dates in the distant future, is to add a test that the Julian year must be < rMW1465072dbe23.

We already do this, see https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/35e6a6bbcd1c94565443404a04d60ad355eb9704/repo/includes/Rdf/JulianDateTimeValueCleaner.php#L62 (since d6f5080ca43a4a7064c7a7e7eb558141a73bee54).

Here is a query to try the effect of this change: https://query.wikidata.org/#SELECT%20%3Fitem%20%3Ftime%7B%0A%3Fitem%20wdt%3AP31%20wd%3AQ36507.%0A%3Fitem%20wdt%3AP585%20%3Ftime%0A%7D%0AORDER%20BY%20%3Ftime%0ALIMIT%2020 It should now find Items with both Julian and Gregorian dates. Before dates at (roughly) 5000 BC and before could not be found and not ordered by date.