Represent normalized unit values in full values RDF
Closed, ResolvedPublic

Description

Values can be represented in a variety of units. However, to make queries against the data, we need to bring the compatible values - such as distance expressed in meters, kilometers or miles - to a common normalized value. Thus, we need a data item that expresses the normalized value of the claim.

In order to do that, we will have additional value attached to statement/qualifier/reference with normalized values - i.e. in addition to psv:, pqv: and prv: we also have psn:,pqn: and prn: respectively.

These values will be generated for now for values of type datetime and quantity only, as only these types need to be normalized.

The value will contain "normalized" value, as follows:

  • For quantity, it will be the main unit value, converted according to conversion rules in configuration (TBD)
    • Upper/lower bound values are converted too
  • For datetime, it will be converted Gregorian ISO value, while the main value is converted to string value
Smalyshev updated the task description. (Show Details)
Smalyshev raised the priority of this task from to Normal.
Smalyshev added subscribers: Smalyshev, daniel, aude.
Restricted Application added a project: Discovery. · View Herald TranscriptOct 29 2015, 10:26 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana moved this task from Needs triage to WDQS on the Discovery board.Oct 29 2015, 12:19 PM
Jc3s5h added a subscriber: Jc3s5h.Oct 29 2015, 3:08 PM
Jc3s5h added a comment.EditedOct 29 2015, 3:45 PM

The phrase "Gregorian ISO value" requires further consideration. For example, does it mean an ISO 8601 compliant value, or the strings currently in use which resemble ISO 8601 but permit a number of violations, such as "2015-00-00T00:00:00Z"?

Also ISO 8601 requires the Gregorian calendar, which is a solar calendar based on counting actual sunrises and sunsets. For events in the distant past, most astronomy and other science is based on constant-length seconds such as produced by an atomic clock (see the Wikipedia article "Delta T"). The difference between these two approaches grows to a full day about the year 3400 BC. Thus ISO 8601 is not fit for use in prehistoric times. It is equally unfit for use in the distant future.

An additional issue is finding a conversion algorithm to convert between Julian and Gregorian dates. Algorithms that I'm aware of have been tested no further than 8000 BC to AD 12000. If a new method for expressing dates in the distant past or future is to be devised, a method to test any proposed conversion algorithm will have to be devised.

It means ISO 8601 compliant value.

The difference between these two approaches grows to a full day about the year 3400 BC. Thus ISO 8601 is not fit for use in prehistoric times.

Do we have any dates around or before year 3400 BC that are know with day or less accuracy? In fact, in years 3000 BC and below, there are only two dates with precision of 9 (year) and one with precision of decade, and none with more precise dates. In fact, I do not think it is possible to have dates with higher precision for such years in either of the calendars we are currently converting (Gregorian and Julian), since I can not see where such precise dates would be coming from.

So I think we'll be fine.

We have "universe" (Q1) which displays the start date as "
13798 million years BCE Gregorian" (Gregorian is a superscript). But the furthest back one could possibly extrapolate the proleptic Gregorian calendar was the first time the Earth rotated on its axis, which was much later (around 4,540 million years ago, measured in constant length seconds). So the date stated for the start of the universe in wikidata is false because the calendar was undefined and beyond any plausible extrapolation procedure at the time of the event.

@Smalyshev I only see one potential use case for precise dates in the distant past: astronomical events. But it seems unlikely we'll ever have items about individual eclipses or conjugations of planets in the far past.

@Jc3s5h The ISO date would be used as the "normalized" value, intended for indexing/searching. Even if it does not quite work for distant dates, it should do the trick in almost all use cases we currently have. The only alternative I see would be to use an integer to represent the time, e.g. a 64 bit unix epoch, that would be good for a range of half a trillion years. But it would make it hard to compare these values with dates from other sources.

If our documentation or source code comments contain falsehoods then the Wikidata team is a pack of liars. ISO requires the Gregorian calendar. If you use an ISO-like notation to represent something other than the Gregorian calendar we are liars and Wikidata would deserve to be defunded. STOP WRITING ISO UNLESS YOU REALLY MEAN IT AND INTEND TO ABIDE BY EVERY SINGLE RULE IN THE SPEC.

@Jc3s5h Please calm down. Documentation is often outdated or inconsistent or unclear. We can work to improve it, but it will never be perfect. But that is not the issue here, since I did mean ISO - well, technically, I mean xsd:datetime. I did not mean "iso-ish string maybe using some other calendar".

We would use an xsd date (ISO) for the *normalized* value, for indexing and comparison. Internally, the value would be stored in a calendar specific way - for gregorian and julian, in something that uses a similar syntax as ISO. That "original" value would also be visible in JSON output, along with the normalized form (currently, the normalized form is not there yet). In RDF, it's still unclear whether we'll include the original value at all, or only use the normalized form.

The same kind of normalization would be applied to make sure that all values of the "length" property will be comparable by converting them all to meters (while still making the original value available in rdf). You would have two values associated with the statement, one "original" (in miles or whatever), one "normalized" (in meter).

@Jc3s5h: Please read the Phabricator etiquette. Thanks for your understanding and your help to keep this discussion technical instead of personal.

xsd:dateTime reiterates the necessity of using the Gregorian calendar. The Gregorian calendar cannot possibly exist before the Earth, and Wikidata has a need to express dates that occurred before the Earth was formed. So either the TimeValue data type must be redefined to limit values to those that fall within the range of validity of the Gregorian calendar (the exact limits are not obvious) and a different data type created for dates in the far distant past or future, or we should stop naming or implying any external standard and acknowledge its something we made up ourselves, unrelated to ISO or the World Wide Web Consortium. It is not honest to knowingly ascribe to others statements that we know they did not make, such as the possibility of expressing the age of the universe in a notation that conforms to an ISO or W3C standard.

I would add that the question of whether, in the future, the transition from one day to the next should be determined by the rotation of the earth, or by atomic clocks which disregard the earth's rotation, will be the subject of (according to Rachel Courtland) a "fierce debate" at the World Radiocommunication Conference in November. Treating an international organization's standard as atomic-based when it is really rotation-based becomes more serious during a time when a "fierce debate" is raging.

An additional issue with borrowing the format from ISO 8601 (without using the definitions from ISO 8601) is that that standard requires a fixed number of digits for the years. So if the age of the universe is to be representable, today would have to be written 00000002016-02-26. Doing otherwise would force data consumers to check their ISO 8601 parsing software to see if it can handle our violations.

@Jc3s5h Basic ISO 8601 is not perfect for our goals, since we need to represent dates that non-extended ISO format can not. Thus, we are extending it, in accordance with W3C guidelines:

To accommodate year values greater than 9999 additional digits can be added to the left of this representation.

Note that this standard does not require fixed digits.

Now, we know that some data stores are unable to accommodate extended date ranges, especially one that are implemented in Java and take shortcut of representing xsd:dateTime as Java calendar/datetime values (which can not accommodate the full range of dates Wikidata has). These solutions will have to be either modified or be unable to process certain data. There's no way around it - if they can not read our dates directly, they'd have to do something extra to read them. So far we've chosen the way that would be as close to the W3C standard as possible and not require to big a departure from ISO.

Jc3s5h added a comment.EditedFeb 20 2016, 12:12 PM

Let's discuss which version of the schema to use a a starting point. You suggesthttps://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#dateTime . But the linked version of the schema does not allow year zero, the linked version of the schema purports to support leap second.

A later version, https://www.w3.org/TR/xmlschema11-2/#d-t-values defines 0000 as 1 BC and does not support leap seconds (but nevertheless purports to use UTC).

I don't think we can proceed unless we decide which version of the schema to use as a starting point.

I also think we need to specify the context of this proposal. Wikibase/DataModel/JSON, in particular the time section, specifies the flavor of JSON that is used to represent WikiBase entities in the API, JSON dumps, and Special:EntityData when JSON output is specified. It is not the internal representation of WikiData entities.

It appears to me there is no formal connection between the W3C XML schema (any version) and JSON. So using something from there would just be an informal borrowing. Have I got that right?

Jc3s5h added a comment.EditedFeb 20 2016, 6:35 PM

For the name RDF Dump Format I note that time already has a simple value and a full value. The simple value is already specified to be in the XSD 1.1 format if the full value can be converted to Gregorian. Would this simple value be sufficient?

We are currently using XSD 1.1 format in both cases, and in both cases the date is (proleptic) Gregorian.

We are currently using XSD 1.1 format in both cases, and in both cases the date is (proleptic) Gregorian.

I think Smalyshev didn't use the word "currently" clearly in this sentence. Maybe he means "the proposal as currently envisioned". But he can't mean the current version of WikiData, because the current version has no Julian to Gregorian conversion software, and quite a few dates in the database are currently stored in the Julian calendar.

A different question: what programming language would the future Julian to Gregorian routine be written in. Maybe I can find something, or translate something.

Change 296962 had a related patch set uploaded (by Smalyshev):
Unit conversion support for full statements

https://gerrit.wikimedia.org/r/296962

Change 296962 merged by jenkins-bot:
Unit conversion support for full statements

https://gerrit.wikimedia.org/r/296962

daniel added a comment.Sep 6 2016, 4:46 PM

The above patch implements this for quantities. Datetime is already always normalized to ISO (Gregorian).

Shall we close this, or wait until the subtasks are closed?

Smalyshev closed this task as Resolved.
Smalyshev claimed this task.

I think this one is done now.