Page MenuHomePhabricator

Better support for exact values in Quantity DataType
Open, MediumPublic


Some quantity properties "intrinsically" have exact values - like the number of people in a parliament, the number of planets in the solar system, etc. It's confusing to users that Wikibase will guess a margin of uncertainty of +/-1 in such cases.

Considering that this guess would be correct for e.g. the number of square meters in a room, we can't use a general rule for avoiding this problem. We need some kind of hint to the parser what heuristic to apply for guessing the uncertainty. That hint could be

a) A per-property flag (possibly represented by a claim)
b) A separate "amount" or "number" DataType, which would also use QuantityValue, but would have different parsing rules from the "quantity" DataType used for measured values.

Note btw that "population" is a natural number count but should *not* default to exact values, but apply the +/-1 rule.

Version: unspecified
Severity: major
Whiteboard: u=dev c=frontend p=0



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:25 AM
bzimport set Reference to bz66580.
bzimport added a subscriber: Unknown Object (MLST).

Sorry if this is the wrong bug. It appears the same problem occurs from the database.

When I wbsetclaim a quantity with value amount:+0.044405586 , it returns successful, but the value is amount:+0.04440558599999999689345031583798117935657501220703125

Still unclear to me why there is any "margin of uncertainty" by default at all. The quantity value is extracted from a reference. It is each particular reference that defines the margin. If there is no margin specified by the reference, there is no way to guess what the margin would be. And still, that is what is trying to be achieved here with even more complicated logic. How would one determine that 100 has an uncertainty margin of 1, 10 or 100 if it is not stated by the reference?

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

""Note btw that "population" is a natural number count but should *not* default to exact values, but apply the +/-1 rule.""

Why? Why +-1 and not +-100? has a population of 29. The statistics authority gives an exact number. has a population of 10.815.197. Also an exact number.
It is arbitrary to assume +/-1 or whatever for both numbers.

I also don't understand why there is any margin of uncertainty by default (or why that margin is +/-1). The default should be to assume that the number is correct. Most populations, for example, are from censuses and are exact numbers. While it's true there are some numbers that shouldn't be exact by default (elevation?), I can't think of a single case where it makes sense for the margin of uncertainty to be +/-1. Surely +/-0 is a better default.

@kaldari: The Quantity data type is mainly intended for measured values. These are never absolutely exact. This notion becomes crucial when applying unit conversion (we'll have that Really Soon Now TM): If a building is said to be 281 feet tall, and we convert this to meters for display, the result should not be 85.6488 - that would imply a level of accuracy not present in the original value. Conventionally, "281 feet tall" means 281 feet +/- 1 (or +/- 0.5 -- but that's a different discussion). With that assumption, we can say that the building is 56.7 meters tall, +/- 0.3088. This results in one significant digit after the decimal point to be included in the output, reflecting the level of accuracy in of original value correctly.

The case is different for exact counts, like the numbers of electrons in an atom, or the number of seats in a parliament. But in the bigger picture, these are the exception. The impression is distorted by the fact that we sadly still don't have unit support.

Introducing a separate data type for exact counts (not population!) would allow us to avoid the "assumption of uncertainty" for properties that typically have exact values.

@danial: I don’t think ±1 is for some reason a better default for measured values than any other (±0.5, ±10, whatever). I think it would make more sense to mandate an explicit accuracy (with a helpful error message if missing, and possibly populating the text field with “±0”).

@DSGalaktos not for "some reason", but because it's the scientific standard to not assume absolute accuracy per default.

From "In science and engineering, convention dictates that unless a margin of error is explicitly stated, the number of significant figures used in the presentation of data should be limited to what is warranted by the precision of those data."

And from " If a calculation is done without analysis of the uncertainty involved, a result that is written with too many significant figures can be taken to imply a higher precision than is known".

If we want to be able to do any arithmetics with the Quantities (like unit conversion), we should not assume absolute certainty per default, since it would introduce "false precision".

@DSGalaktos ...that being said, I agree that we should make the precision (implicit or explicit) more visible and understandable.

So should there be additional type - like exact integer - for exact values? Floats are more complicated as computer floats are inherently inexact but there could be a value of exactly 0.5... OTOH, we don't even know how to represent a value of exactly 1/6. But at least integers are a frequent use case. It's possible now to make it +/- 0, but it's completely unintuitive and hard to use. Maybe explicit precision (or something like with years?) would be better.

@Smalyshev yes, see T72341.

Also, Quantity does not use floats. It's an arbitrary precision decimal string internally (limited to 127 characters in the UI).

There needs to be a precision unknown or precision unspecified as default, not any arbitrary value, +-1, +-0, or any other number. That's the most likely case for most numbers (population being probably the largest use of the number property).

I see some strong opinion that specially for population a "margin of uncertainty" is needed by default. I don't get it. Experience suggests that if there is a "margin of uncertainty" it will be mentioned in the sources. Common sense suggests that if it is not mentioned, then any default or manual specified margin, +/- 1 or +/-50000 is arbitrary, since the default applies to numbers between zero and billions. They would not have the same level of uncertainty.

The population of a small village would not have the same margin of uncertainty with the population of a country. For example: a small island with population of 1 (one). Is it ±1? Does it suggest that the population is in the range of 0-2? No, the census agents found only one person stating that he lives in that island. Other small islands and abandoned villages have population of 0 (zero). The arbitrary "margin of uncertainty" would suggest that "maybe there is at least one" even if there is no building to house him.

And while the default ±1 for our small island may suggest that the census agents either missed a person or counted someone who does not live there, the same default ±1 suggests that the population of the country is precise to one person more or less, while the addition of the populations of all cities, villages and islands like this would need to summarize also these margins of uncertainty.

On the other hand, for the USA we also have ±1 for a population of 318.697.314!!! Probably the default. But what does that mean? The census on that small island may have a 100% difference with reality, but an estimation for the whole USA does not have more that 0,0000000001% error?
For Germany it is at ±500, for France ±50.000, Netherlands ±1...
No. None of the sources used for these populations mention any precision or margin of uncertainty.
If for an almost uninhabited island we specify a default ±1, then ±500 sounds better for a population of millions. and ±50.000 sounds even better. But why do I need to choose?

@geraki I agree that +/- 1 is a bad default for population. The only default that would be even worse is +/- 0.

The issue is that we need a heuristic that isn't specific to population, or to counts, but can be used for all quantities, including measured amounts like length, temperatur, weight, etc. These are never absolutely exact.

In science, there's a convention to indicate the level of uncertainty using the number of digits given. So 3.21 would imply an uncertainty of +/- 0.01, and 7e2 would imply +/- 100. The same convention dictates that the uncertainty of the number 44376 is +/-1.

The the heuristic was just for population, I'd probably say the default should be +/- 1% or some such. But the software does not, and should not, know about "population". Now, we could have a statement on the "population" property that would change the uncertainty convention to be different from the "one size fits all" default. An idea worth discussing, but not at all easy to do, if you look into the nitty gritty.

@daniel The default uncertainty value is not bad only for population. I is also bad for atomic number (P1086), number of masts (P1099), seating capacity (P1083), HDI value (P1081), total quantity produced (P1092). number of cylinders (P1100), floors above ground (P1101), votes received (P1111) etc. In fact I don't see many properties with the number data-type where a (default or not) level of uncertainty does make sence.

It is better (for the nuber datatype) to be precision unknown or precision unspecified as default . Leave third party applications to choose if they actually need to apply a default level of uncertainty, if it is needed for a specific context.

@geraki: Yes, these properties refer to exact counts of things. That's exactly why I propose to introduce a separate data type for counts, see T72341. The current situation is distorted by the fact that we don't have measured amounts, because we don't support units yet. So currently most quantities are counts, and the defaults are kind of odd for that. But that's just the first use case, not the main use case, for the quantity data type (note that there is no "number" data type. It's a quantity, not a number).

Introducing "unspecified precision" into the data model is also worth considering, I think. But we'll have to consider this carefully.

@daniel: Wikipedia manages to do sensible automatic unit conversion without specifying uncertainty (probably by looking at the number of digits), so why can't Wikidata? I realize that setting the default uncertainty to 0 is wrong for measurements, but couldn't null be the default, and the conversion script switch to using number of digits in the case of null uncertainty? Also, I believe the implied uncertainty of whole number measurements is +/-0.5, not +/-1.

@daniel: Wikipedia manages to do sensible automatic unit conversion without specifying uncertainty (probably by looking at the number of digits), so why can't Wikidata?

No it doesn't - have a look at
It uses a sensible default, but in many cases, you actually have to specify the precision explicitly.

I realize that setting the default uncertainty to 0 is wrong for measurements, but couldn't null be the default, and the conversion script switch to using number of digits in the case of null uncertainty?

That's pretty much what we do: if no uncertainty is specified, we look at the number of digits to determine it.
We could also do this upon conversion, but then we'd need to store an extra flag for "auto precision". Null could work. I'm a bit worried that this would cause a performance problem when doing a lot of conversions at once (e.g. when normalizing measurements to a single unit, for indexing and comparison); Calculations are done on decimal strings, not floats. This can get expensive.

But what about API output and dumps? Should that omit the uncertainty, or contain the calculated uncertainty? If we omit it, people are likely to wrongly assume 0; if not, we force them to implement their own logic for significance arithmetics, leading to duplicate code and inconsistencies. If we always calculate and include it - then we can just as well calculate it right away and store it.

Also, I believe the implied uncertainty of whole number measurements is +/-0.5, not +/-1.

That was also my intuition, but @Denny convinced me that this was wrong. Reading up on this now, Accuracy_and_precision#Quantification seems to support your (and my) intuition. The only thing I could quickly find in support of Denny's version is Significant_figures#Estimating_tenths, but it doesn't really seem to apply.

We should probably re-examine this question. Filed a ticket: T105623: [Task] Investigate quantification of quantity precision (+/- 1 or +/- 0.5)

daniel set Security to None.

Note btw that "population" is a natural number count but should *not* default to exact values, but apply the +/-1 rule.

This doesn't make sense to me. A population count is a count, not a measurement. It doesn't have uncertainty. It will never be converted to other units. It may be totally wrong, but that's different than measurement uncertainty/precision. Even if it is an estimate, the estimate is still a specific number, unless an uncertainty is provided. I would strongly argue that a population should be considered an "amount" or "number", rather than a quantity with precision.

@kaldari Population numbers are very rarely counts. They are usually estimates or extrapolations, which always have uncertainty, even if they are not explicitly stated. Exact counts would be the number of electrons in an atom, or the number of seats in the parliament. There is no margin of error there, they are not measurements. Population is never an exact count.

it seems that the following consensus is forming:

  • if no precision is given in the input, we should just store the number without any uncertainty information (but not with "no uncertainty", which would be +/-0).
  • if an uncertainty interval is explicitly given, it is always shown.
  • when an uncertainty is needed to provide rounding to avoid "false precision" (in particular, after unit conversion), the uncertainty is derived from the number of significant digits. "Automatic" uncertainty would not be shown to the user per default.
    • The +/-0.5 approach is preferred for consistency with the rounding algorithm. Edge case: integers with trailing zeros, e.g. 1200, are ambiguous (could be 2 or 4 significant digits).
  • Properties that represent exact counts (seats in parliament, members in a team) should be converted to a new data type ("number") that defaults to +/-0 and has no units. (edge cases exist: is population an exact count? Should the number of dogs in a sled team have the unit "dog"?). Values of such properties that are currently marked as +/-1 can be changed to +/-0 by bot.
  • Properties for measured quantities that are currently marked with +/-1 or +/-0 can be changed to "unknown uncertainty" by bot. The rare cases that actually should have +/-0 can be skipped by the bot or fixed by hand.

any objections?

Population is never an exact count.

It seems to me that most population numbers in Wikidata are census counts, not estimates, although I could be wrong.

Even census counts are rarely absolutely exact, though in some contexts they are "assumed" to be, to make things simpler. I consider this an edge case that can be solved by either treating population as a measurement and setting +/-0 when appropriate, or treat it as a count and explicitly set +/-1000 or whatever when appropriate. It's up to the community to decide which they want to do.

  • Properties that represent exact counts (seats in parliament, members in a team) should be converted to a new data type

No! Not yet another one, please!

Macro over9000: over 9000 data types

@Ricordisamoa why not? What's the problem with another data type?

@Ricordisamoa why not? What's the problem with another data type?

First, the use case for "exact counts" having a data type on their own seems weak; then, it'd set a precedent for datatype changes becoming much more frequent than they should be.

I think if it'd be easier/more obvious how to make exact value (+-0 is not the most obvious incantation) and the display format for 100 mln would be "100 mln" and not "100,000,000 +- 1,000,000" - just like we do with dates - the problem would be much less acute in this case.

I believe I've said this data type is wrong a few times (!) back at WMDE. It tries to both be a representation for a scalar datatype with a precision, and a range datatype. It fails badly at both.

Split the datatype in a scalar type and a range type, both with precision, or use a bignum representation of the numbers. [Already uses bignum? This should be documented, Lua don't support bignum.]

If this is to be kept like this, then make it possible to use the value both as a scalar and range datatype, with an explicit precision. The precision should not say anything about the uncertainty of the numbers. That is orecision and uncertainty is two different concepts.

For measurements with uncertainty either make representations of three-point estimation, five-number summary, and seven-number summary, or add an explicit extension of the value for uncertainty.