Page MenuHomePhabricator

[Story] Decide how to represent quantities with units in the "truthy" RDF mapping
Open, Stalled, MediumPublic

Description

In the "truthy" RDF mapping, DataValues are represented as plain RDF literals (or resources), associated directly with a property (as the predicate) and an item (as the subject), e.g.

Q23 P1477 "Douglas Noël Adams"@en

How shall we represent a quantity with unit here? We could use plain strings with unit identifiers, and invent a type for those:

Q3375 P.... "2962m"^^"wikibase:value-with-unit"

But no triple store will support this out of the box. Especially not if values for the same property use different units.
Units can be normalized by converting to a) the unit's base unit or b) to the property's "standard" unit (if we decide to define such a thing). Option (b) would allow us to use the plain number:

Q3375 P.... "2962"^^"xsd:decimal"

Any values that can not be converted to the property's standard unit would be omitted.
This would work nicely with indexing and querying. But it's semantically awkward: one would have to look up the definition of the property to know 2962 what.

Is there a way in RDF to declare that all values of a given predicate are to be interpreted as using a specific unit of measurement?

(Side note: all this doesn't touch upon representing upport and lower bound. These would probably use separate (derived) predicates).

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
ResolvedSmalyshev
DuplicateNone
DuplicateNone
OpenNone
OpenNone
OpenNone
OpenNone
ResolvedFeatureMichael
Resolveddaniel
OpenNone
ResolvedLydia_Pintscher
Resolvedthiemowmde
Resolveddaniel
OpenNone
ResolvedLydia_Pintscher
ResolvedLadsgroup
OpenNone
ResolvedLadsgroup
OpenNone
Resolvedhoo
OpenNone
StalledNone
StalledNone
OpenNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
InvalidNone
Resolveddaniel
ResolvedSmalyshev
OpenNone

Event Timeline

daniel raised the priority of this task from to Needs Triage.
daniel updated the task description. (Show Details)
daniel subscribed.

Note that this discussion is no longer just about the wdt property values (called "truthy" above). Simple values are now used on several levels in the RDF encoding.

In general, the same argument as for coordinates applies: if we cannot do it right, then better not do it at all (i.e., use a bnode until we have a format). This might always be necessary in some cases (e.g., even if we convert units, there might be cases where conversion is not possible).

I agree with the advantages and disadvantages of using a custom datatype. Without BlazeGraph support for this, one would not be able to do range queries over such data, which would make it pretty useless. We could as well use strings in this case.

The normalisation of units by converting them to a base unit would still leave important problems. If there would be a community controlled way to define conversions, there would be the problem that the "main" unit that the RDF data is normalised to might change. This would change the content and meaning of simple values even though actual property values have not changed. Somehow declaring this in other triples in the RDF dump would not solve this, since we assume many fixed (standing) queries to be used which would not be able to adapt automatically to a new unit declaration. The normalisation scheme would also create problems for incremental update: a single change in the conversion definitions would require changes in millions of simple values that are part of the export of items that have not changed at all.

A possible solution to work around the absence of a datatype and even in the absence of conversion support would be to create properties like "P1234inCm" and "P1234inInch". They would have plain number values that work in range queries. This would basically simulate the custom datatype with very similar effect on query answering (users would need to adjust queries to specify the unit that is queried for, but they would at least be sure that the data they query refers to this unit). The downside is that you need a different property for each unit, and that therefore you still have no good value to use for the simple value properties. However, I think this is how other datasets are doing it (has anybody checked DBpedia?).

If we could distinguish type quantity properties that require a unit from those that do not allow units, there would be another options. Then we could use a compound value as the "simple" value for all properties with unit to simulate the missing datatype. On the query level, this would be fully equivalent to having a custom datatype, since one can specify the unit and the (ranged) number individually. (While the P1234inCm properties support only the number, but no queries that refer to the unit).

Using a compound value as a simple value is fine. It's not worse than a bnode if you do not want to look into the inner structure, but it has additional features for those who want. The only problem is that you should not mix number literals with URIs that refer to compound values for the same property -- this is why one would need to fix in the property datatype whether units are required (always there) or forbidden (never there). Mixing this would not work.

I think the discussion now lists all main ideas on how to handle this in RDF, but most of them are not feasible because of the very general way in which Wikibase implements unit support now. Given that there is no special RDF datatype for units and given that we have neither conversion support nor any kind way to restrict that a property must/must not have units, only one of the options is actually possible now: export as string (no range queries, but minimally more informative than just using a blank node).

It would be possible to export data as numbers for unit-modified properties such as "P1234inCm" in addition. This can only be an additional feature though, since we still need a simple value in any case. It might not be worth to do this, since one can always use the complex value to access the number in any case. Note that the properties "P1234inCm" would need to have very complicated, lengthy, unreadable names since units in Wikibase are represented not as "cm", and not even as item ids, but as
full URIs. But you cannot use a URI within another URI directly -- you would need to escape certain characters. Moreover, the resulting string might not be allowed as a local name in abbreviations like wdt:P1234, so users would have to type the full URI. Therefore, it seems that using the (already existing) complex values in such queries would actually be more readable.

daniel triaged this task as High priority.Sep 11 2015, 10:10 AM

Bumping to high. We have units now, we need to somehow represent them in RDF.

JanZerebecki renamed this task from Decide how to represent quantities with units in the "truthy" RDF mapping to [Story] Decide how to represent quantities with units in the "truthy" RDF mapping.Sep 11 2015, 10:41 AM
JanZerebecki moved this task from incoming to needs discussion or investigation on the Wikidata board.

Technically, we already are representing them, in full values. However, it indeed makes simple values which currently omit units less useful, especially if different values are expressed in different units.

The main challenge I see is how we bring them to the same unit - if we do not do this, having any additional data would be useless as no useful operation could be done on it.

It would be possible to export data as numbers for unit-modified properties such as "P1234inCm" in addition

Unless we fix units on ontology level (which I don't think we're doing right now as units are just items now) we can't have property for specific unit, since in specific repository instance that unit may not exist. We _could_ do just that - import some unit ontology, and link it to items via some designated property. But that also needs representation of other units in terms of that unit - i.e. if yard is expressed in inches it's not enough - we'd also have to have it expressed in meters, since figuring out in runtime if yards and meters can be inter-converted and how would be prohibitively expensive.
That's why I also proposed https://www.wikidata.org/wiki/Wikidata:Property_proposal/all#standard_unit which combined with the ontology mentioned above could enable properties like "P1234inCm".

Alternatively, we could have ontology-neutral property like wstd:P1234 which would express P1234 in standard units, whatever they are. Or, going further, have simple value always expressed in standard units while having "deep" value contain original unit representation.

I think we actually have two separate items here:

  1. How to represent value with units in any context - i.e. how to say "P1234 is 15 kilometers" (with probable subtask of how to present the result nicely in the GUI)
  2. How to create useable "simple data" values given that same quantity can be expressed in different units.

The second task necessarily depends on the first task, but I think it is a different one and can be solved separately.

daniel changed the task status from Open to Stalled.Jul 24 2017, 2:15 PM
daniel raised the priority of this task from High to Needs Triage.

Even though the current situation is bad, there have been few complaints. Since it has been on "high" with no progress for years, I'm putting it back on triage.

Smalyshev triaged this task as Medium priority.Dec 21 2017, 2:14 AM