Page MenuHomePhabricator

[RFC] Human-readable serialization of TimeValue precisions in RDF
Open, LowPublic

Description

The precision of a TimeValue is currently represented as a nonNegativeInteger in RDF. See docs/ontology.owl, recently updated in https://gerrit.wikimedia.org/r/#/c/207632. But these numbers do not have a meaning. They are just constants, for internal use in the Wikibase code base. In RDF it probably makes much more sense to represent these precision values as URIs, each with a well defined meaning.

See the definition of the precisions here: https://www.mediawiki.org/wiki/Wikibase/DataModel#Dates_and_times

Event Timeline

thiemowmde raised the priority of this task from to Needs Triage.
thiemowmde updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 21 2015, 2:42 PM

A big advantage of the numbers is that you can search for values where the precision is at least a certain value (e.g., dates with precision day or above). This would be lost when using URIs.

@mkroetzsch Good argument, I forgot that. Thanks. However, I think the disadvantage of not being readable without referring to external documentation is more relevant. Could be that we can solve this in an other way. Open for suggestions. Is there an obvious way to describe the range and meanings of these constants in the OWL documentation?

@thiemowmde One could have documentation as a text that is added as a description of the property used for precision. However, most users would more likely read a web page than look up the description stored in an OWL file. In the end, when you type in a SPARQL query, there is not much documentation directly available to you, even if it is stored in the RDF database somewhere.

Since our main use of RDF is query answering, I would not see readability as its main requirement, but of course it would be nice to retain this too. One could replace the numbers by strings of the form "08=decade" and "11=day" that would sort as expected while still having some readability. It should be checked if this has a negative impact on the store performance, but in general this should be workable.

On the other hand, maybe we are worrying too much about readability here. All of our RDF uses opaque IDs like "P345" and "Q123456" in combination with several URI prefixes that have specific (not self-explaining) meanings. SPARQL queries based on this vocabulary in general are not readable to uninitiated users. Making the rarely used precision constants readable among all the other unreadable IDs might not add much to usability overall. Maybe it would be more promising to focus on suitable query building interfaces that show human readable labels instead of ids. This would also be much more useful internationally, because English labels hardcoded in URIs would not always be helpful.

I'm sorry to say that, but I don't see how the fact that more tickets exist does make a specific ticket less relevant. You could say that on every ticket. I don't think this is helpful.

The argument that motivates this ticket is simple: 11 is almost completely meaningless. The only relevant information you can get from that is the order, as you pointed out correctly. day is almost self-explaining. You can look up what it means basically everywhere, even if you don't understand English.

Here is an other idea for string constants that retain the order (padded to 16 digits for the year to be able to add more precisions later):

YYYYYYY = 0
YYYYYYYY = 1
YYYYYYYYY = 2
YYYYYYYYYY = 3
YYYYYYYYYYY = 4
YYYYYYYYYYYY = 5
YYYYYYYYYYYYY = 6
YYYYYYYYYYYYYY = 7
YYYYYYYYYYYYYYY = 8
YYYYYYYYYYYYYYYY = 9
YYYYYYYYYYYYYYYYM = 10
YYYYYYYYYYYYYYYYMD = 11
YYYYYYYYYYYYYYYYMDH = 12
YYYYYYYYYYYYYYYYMDHM = 13
YYYYYYYYYYYYYYYYMDHMS = 14

@thiemowmde I don't know what you mean with the mutliple tickets you refer to. I am not aware of other tickets related to readability. I was just saying that the requirement you are trying to address will never be addressed even halfway. It's still nice to improve readability a bit if it is possible without much pain and without any other disadvantages, but I don't think that this is the case here.

The proposal to use constants like "YYYYYYYYYYYYYY" does not seem very practical to me. The constant "7" is arbitrary, but at least it's possible to remember it ;-). Hardly any of the YYY-constants is self-explaining without additional documentation (if you see all of them together, you can get the idea, but in a single query with only one of them present, it's not clear). No, I really don't think that this would improve anything.

Oh, I kind of like that idea.... though counting all the Ys can get annoying, when you just get YYYYYYYYYYYYY somewhere. Adding 0s at the end would also make it clearer what/how YYYYYYY is less precise (implying larger years) than YYYYYYYYYYYYYYY.

I'm not quite sold on this, but it's intriguing...

@mkroetzsch To me, this isn't really about readability, but conceptual clarity - modeling the precision as an xsd:int using our internal constants seems bad. We should either model them as resources with URIs, or using a *meaningful* number, such as the number of digits in the ISO representation, or the number of seconds or some such.

The current for is fine for sorting and filtering. But for someone looking at the JSON or RDF is's totally unclear what the number means, and it's also unclear how to find out. And to interpret the number correctly, e.g. for formatting, you need to build a big switch statement implementing the somewhat random spec.

I like the idea of a visual pattern, but the difference between YYYYYYYYYYYYYYY and YYYYYYYYYYYYY is not comprehensible to humans and will only lead to errors and frustration. At least I can easily tell between 2 and 6, but beyond about 5 identical objects pattern recognition for most people returns "a lot of objects" and you can not work with that. So if we found a better pattern-like representation for 0 to 9, I think it might be workable. Maybe something like 9Y, 8Y, ..., 2Y, Y, YM, etc... ? This however does not sort, but the idea is to get something that humans can distinguish.

Y07 = 0
Y08 = 1
Y09 = 2
Y10 = 3
Y11 = 4
Y12 = 5
Y13 = 6
Y14 = 7
Y15 = 8
Y16 = 9
Y16M = 10
Y16MD = 11
Y16MDH = 12
Y16MDHM = 13
Y16MDHMS = 14

... the number describing the relevant digits of the year when it's padded to the maximum of 16 digits.

the number describing the relevant digits of the year when it's padded to the maximum of 16 digits.

This sound kind of artificial, to be frank. I.e. can you tell which one is "millenia"? (I guess it's Y13 but I'm not sure). Why "year" is Y16 - I don't think we've every had a year in our DB with actual 16 significant digits (we have either 4 digit years or huge years like 13 billion where those zeroes are just scale). So 16 there is some technical number which doesn't make a lot of sense to people. I think we need more ideas on this.

daniel added a comment.EditedMay 21 2015, 8:05 PM

I find it odd to base the numbering on an artificially fixed limit like 16. I also find it odd that the "roughest" precision has the lowest value. Makes sense if you interpret it as "significant number of digits", but that only works with a fixed with.

So I proposed to reverse the order: the smallest ID should refer to seconds, then minutes, and so on - with the scale being open end. Not sure what the code for that should look like, though.

A very simple alternative: give the precision as a float (or decimal), in years. Anything smaller than that number, in years, is insignificant. The values for "hour" etc will not be pretty, but they would have meaning, and arithmetics would work nicely.

Or use seconds with scientific notation - RDF supports that, right?

Words like "millennium", "million" and "billion" tend to be confusing ("billion" actually does have two meanings, right?) and are, in my opinion, not useful for what we are discussing here.

I'm not sure what the definition of "significant" digits should be for what we are discussing here. "1 Million BC" and "1 after Christ" both have 1 significant digit but different precisions. Turning the precision into significant digits only makes sense with padding, which is exactly what I did. There is nothing wrong with that.

The problem with an inversed order is that this does not work:

Ye0 = 9
Ye0M = 10
Ye0MD = 11
Ye0MDH = 12
Ye0MDHM = 13
Ye0MDHMS = 14
Ye1 = 8
Ye2 = 7
Ye3 = 6
Ye4 = 5
Ye5 = 4
Ye6 = 3
Ye7 = 2
Ye8 = 1
Ye9 = 0

Calculating fractions of a year or turning all these different precisions into seconds is not going to work because this would depend on the month and if it's a leap year, wouldn't it?

daniel triaged this task as Normal priority.Sep 10 2015, 3:31 PM

This doesn't break anything, but we should really use semantically sensible identifiers for precision; and when we change it, it's a breaking change to our ontology. So we should ideally do this before our rdf mapping goes out of beta. Bumping to high because of this.

thiemowmde renamed this task from Human-readable serialization of TimeValue precisions in RDF to [RFC] Human-readable serialization of TimeValue precisions in RDF.Sep 10 2015, 4:46 PM
thiemowmde raised the priority of this task from Normal to High.
thiemowmde set Security to None.

Speaking of which, I created T112127 to track moving ontology from beta to release. Please assign all changes that need to be done before the move to it as blockers.

Smalyshev updated the task description. (Show Details)Nov 2 2015, 10:50 PM

Another crazy idea - why don't we just create wikibase:BillionYears, ..., wikibase:Year, ..., wikibase:Second and use that as a value for human-readable precision? Yes, that means new 16 or so individuals, but it shouldn't be that big of a deal I think and we can deal with storage efficiency in Blazegraph libraries.

I do like the strings wikibase:second and so on. We could also use the URIs http://www.wikidata.org/entity/Q11574 for second, http://www.wikidata.org/entity/Q573 for day and so on. But note that both do have the same problem: Not easily sortable.

thiemowmde lowered the priority of this task from High to Low.Feb 28 2018, 4:38 PM

I don’t think the following variant has been proposed yet (apologies if I missed it in the discussion above):

wdv:8000170412b9aeb739d076fed903a0ff wikibase:precision "11"^^xsd:integer. # no change

wikibase:Day wikibase:precisionValue "11"^^xsd:integer. # new!

This would be fully backwards compatible. You could continue to write your query like this –

SELECT ?item WHERE {
  ?item p:P569/psv:P569 [
    wikibase:timeValue ?decade;
    wikibase:timePrecision "8"^^xsd:integer
  ].
}
LIMIT 10

– or make it more readable like so:

SELECT ?item WHERE {
  ?item p:P569/psv:P569 [
    wikibase:timeValue ?decade;
    wikibase:timePrecision/^wikibase:precisionValue wikibase:Decade
  ].
}
LIMIT 10

This uses the somewhat obscure caret (inverse path) operator, but with some good query examples to demonstrate it that shouldn’t be a big problem. It also maintains the sortability of precisions, though it’s slightly annoying because the caret operator isn’t available in expressions:

wikibase:Decade wikibase:precisionValue ?minPrecision.
FILTER(?precision > ?minPrecision) # or just keep hard-coding "8"^^xsd:decimal, in this case that might be better, not sure

(The way the precisions are ordered, “greater than” means “more precise”, which I think is the more intuitive order, so that works out nicely.)

@Smalyshev I assume Blazegraph should be able to optimize the /^wikibase:precisionValue wikibase:Decade construct, but you’re the expert – does this sound good to you?