Page MenuHomePhabricator

RdfBuilder should have an option to switch between XSD 1.1 and XSD 1.0 style dates in the output.
Closed, ResolvedPublic1 Estimated Story Points

Description

Depending on what platform is going to be used to process the RDF generated by RdfBuilder, we'd want XSD 1.1 dates (using astronomical year numbering) or XSD 1.0 (using traditional year numbering).

In order to implement this switch, we have to ce certain about which numbering the internal erepresentation of Gregorian and Julian dates uses. The current assumption is that these use traditional numbering (-0001 means 1 BCE, 0000 is invalid).

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel added a subscriber: daniel.

So I took a look at this one, and we have a weird situation here - probably since T99674 is undecided, but it's more about external representation than internal one. In Wikidata/Wikibase, 1 BCE is stored as -0001. In XSD 1.1 it should be 0000. OK, so should we just add 1 to all negative years? Well, but what about Q1 with "13798 million years BCE" - it shouldn't be suddenly -13797999999, right? That would be weird. Maybe just add 1 to dates that have precision of "years" and below? But that may lead to some weird things too.

Now, we also have possible year "0". Which is not the same as 1BCE and stored as 0000. Should we ignore such year or make it 1BCE too?

Also, when we translate Julian to Gregorian, should "1 BCE (Julian)" - with no days - become "-0001-12-30" ISO Gregorian or "0000-01-01" ISO Gregorian? Should we do Gregorian->Julian conversion for dates without days at all or should we keep the year and just put 01-01 on it? Note that since Gregorian and Julian years are not in perfect sync, we can't really know in which Gregorian year the event with non-dated Julian year is.

In reply to the comment of Smalyshev on Wed, Jun 24, 23:47 UT.

As far as I know, there is no proposal to use automation to fix the bad entries in the database. So what they get changed to depends on the good judgement of the editor who has looked up the source that was cited to support the claim, or who has found a new source to support the claim.

As for translating between Julian and Gregorian, it could depend on what software is doing the translation and why. In general, I think we want to insure successful round trips, so a conversion Julian --> Gregorian --> Julian results in an unchanged value. Since we have decided to support storing both calendars, I don't see why Wikidata would be doing any conversions in the user interface; the editor decides which version to enter. When viewing an entry with the user interface, only the calendar that was entered will be shown. If Editor A enters 1 January 1 BCE, precision = day, before = 0, after = 0, then something like "-0001-01-01" together with the calender and precision information gets stored. When Editor B views it later, she sees pretty much what Editor A entered, although it might be in the date format customary in some non-English language. If Editor B wants to know what that is in Gregorian, it is up to Editor B to convert it herself.

If Editor C wants to enter an event that occurred in the Julian year 1 BCE, Editor C must enter a combination of appropriate precision, before, and after values. precision = year, before = 0, after = 0 is wrong! Editor C could enter 1 Jan -1 BCE, precision = year, before = 0, after = 1. Or he could enter 2 Jul -1 BCE, precision =day, before =183, after = 183, calendar = Julian, which would be just as valid.

So this brings up the question of whether allowing the editors to select any equivalent combination of precision, before, and after they please is a good idea, or should we decide upon a cannonical form?

So I think we should adopt the following approach for getting from Wikidata datetime to ISO/XSD datetime:

  1. If date is Gregorian and positive - check it for ISO format breakage (0 day/month, 31 February, etc.), fix it by replacing 0 with 1 and overflowing day/month with largest allowed value. Year 0 is treated as broken data and aborts conversion.
  2. If date is Gregorian and negative, do the above and then if the precision is year or higher, then for XSD 1.1 mode add 1 to the year (i.e. move the year 1 closer towards zero).
  3. If the date is Julian, check the precision. If the precision is month or lower, treat the date as Gregorian above.
  4. If the precision is day or higher, do the cleanups as in (1) and then convert the date to Gregorian. If converstion fails, this is bad data. Then process the negative years as described in (2).

This should give an expected result on most data, and should not do anything weird except in weird cases - like day-precision date with 10 billion years ago (which makes no sense).

Change 222215 had a related patch set uploaded (by Smalyshev):
T99795: Improve handling of calendar dates and introduce XSD 1.1

https://gerrit.wikimedia.org/r/222215

So I think we should adopt the following approach for getting from Wikidata datetime to ISO/XSD datetime:

  1. If date is Gregorian and positive - check it for ISO format breakage (0 day/month, 31 February, etc.), fix it by replacing 0 with 1 and overflowing day/month with largest allowed value. Year 0 is treated as broken data and aborts conversion.
  2. If date is Gregorian and negative, do the above and then if the precision is year or higher, then for XSD 1.1 mode add 1 to the year (i.e. move the year 1 closer towards zero).

These two steps assume Wikidata is using the convention that there is no year zero. This may have been true in the past, but in the future year zero should be allowed. So this approach will break when Wikidata is fixed.

  1. If the date is Julian, check the precision. If the precision is month or lower, treat the date as Gregorian above.
  2. If the precision is day or higher, do the cleanups as in (1) and then convert the date to Gregorian. If converstion fails, this is bad data. Then process the negative years as described in (2).

This approach smacks of general sloppy speech. If we are doing stuff like sorting, we would have to pick some point in the uncertain period to compare to other values, maybe the beginning of the uncertain period. It is a fact that the beginning of October 1582 Gregorian was earlier than the beginning of October 1582 Julian, and thus should sort earlier.

This should give an expected result on most data, and should not do anything weird except in weird cases - like day-precision date with 10 billion years ago (which makes no sense).

Year zero doesn't make sense for Wikidata because 1BCE is stored as -1, so what 0 would be? I don't see any actual year to be represented by it. If storage format changes then we'll need to change it, but then we'd also have to change all BCE dates.

It is a fact that the beginning of October 1582 Gregorian was earlier than the beginning of October 1582 Julian, and thus should sort earlier

If you need beginning of the October, then you need day precision. If you say month precision, you don't know if it's beginning, end or middle of October, thus you can not compare it with, say, October 1 or October 15.

Year zero doesn't make sense for Wikidata because 1BCE is stored as -1, so what 0 would be? I don't see any actual year to be represented by it. If storage format changes then we'll need to change it, but then we'd also have to change all BCE dates.

The data model says 1 BCE is indeed stored as 0, The user interface, in defiance of the data model, would store a date entered as 1 BCE as -1. But an entry made with the API would not necessarily have done so. The current contents of the database are mixed, unreliable, and are unfit to be converted. Any approach that depends on converting anything should be postponed until the contents of the database is scrubbed.

If you need beginning of the October, then you need day precision. If you say month precision, you don't know if it's beginning, end or middle of October, thus you can not compare it with, say, October 1 or October 15.

I don't think people would want to take the approach that "15 November 1582 Gregorian" and "November 1582 Gregorian" are entirely different kinds of things that cannot be compared, just like the proverbial apples and oranges cannot be compared. If I make a query asking for everyone born between March and December 1582 Gregorian, should the query ignore a guy born on 15 November 1582 Gregorian because that date is day precision, and my query is month precision, and day precision cannot be compared to month precision?

Handling queries on dates that include precision expressions requires some careful thought and a lot of arithmetic. Essentially, you have to find the earliest and latest instant for the query, then the earliest and latest instant for the items, and see if any part of the range for the item falls within the range for the query.

The user interface, in defiance of the data model, would store a date entered as 1 BCE as -1. But an entry made with the API would not necessarily have done so

It'd be a problem if API client doesn't follow what people entering data into Wikidata follow. But since in both PHP and Java 1 BCE is -1, this is a natural way for internal representation, so since I have to assume one of these options, I assume this one.

I don't think people would want to take the approach that "15 November 1582 Gregorian" and "November 1582 Gregorian" are entirely different kinds of things that cannot be compared,

That's effectively what it is. We could make some effort to make it a bit easier (like saying "wink-wink, November 1582 is actually 1 November when comparing to dates") but then you can't demand both rigor and convenience. If you compare the two, I don't see how you can define the comparison (in terms of more/less/equal) that would be unique and make sense.

If I make a query asking for everyone born between March and December 1582 Gregorian, should the query ignore a guy born on 15 November 1582 Gregorian

It won't, unless you specifically write query that way. Dates and precisions are stored separately, and most people would query against dates directly (it also would be much faster, probably). Which means, people expect ISO date representation of each date that makes sense for such queries.

However, if you try to compare "November 1582 Gregorian" and "November 1582 Julian", you can't really expect any defined result. Of course, the actual store will return some result - right now with this patch it would say they are equal, even though they are not actually same stretches of time.

Handling queries on dates that include precision expressions requires some careful thought and a lot of arithmetic.

Unfortunately, implementing this in a real indexed database would be quite a challenge. Currently I'm pretty sure Blazegraph doesn't support anything like that OOTB and it's not likely any other triple store would support such thing OOTB. It can be done, but that's not a concern for this ticket - the concern for this ticket is to figure out the way to export the data into RDF that would make sense at least for the majority of the prospective use cases.

If you're going to do comparisons and queries on the basis of the stored date, and ignore precision, before, and after, then the results are going to be rough. Given that comparisons and queries are inevitably inaccurate, why not obey the data model and convert year 0 in the database to year 0 in XSD 1.1. Sure, it's wrong for many of the entries currently in the database, but the comparisons are inaccurate anyway. I like the idea of making it as painfully obvious to everyone that much of the data in the database is wrong. Keep screaming LIAR LIAR PANTS ON FIRE ! at every opportunity until it gets fixed.

Change 222215 merged by jenkins-bot:
T99795: Improve handling of calendar dates and introduce XSD 1.1

https://gerrit.wikimedia.org/r/222215

Not sure if anything is left to be done here?

Lydia_Pintscher claimed this task.
Lydia_Pintscher added a subscriber: Lydia_Pintscher.

Closing for lack of response to Stas' question.