Page MenuHomePhabricator

[Task] Investigate quantification of quantity precision (+/- 1 or +/- 0.5)
Closed, ResolvedPublic

Description

We currently assume an uncertainty interval of +/-1 or the least significant digit of a given quantity:

10.2  =>  10.2 +/- 0.1  =>  interval [ 10.1, 10.3 ]  => uncertainty magnitude 0.2
13e3  => 13e3 +/- 1e3 =>  interval [ 12e3, 14e3 ]  => uncertainty magnitude 2e3

This means the magnitude of the uncertainty interval is twice the magnitude of the least significant digit. It seems more intuitive to assume the size of the interval to be one order of magnitude:

10.2  =>  10.2 +/- 0.05  =>  interval [ 10.15, 10.25 ]  =>  uncertainty magnitude 0.1
13e3  => 13e3 +/- 0.5e3 =>  interval [ 12.5e3, 13.5e3 ]  => uncertainty magnitude 1e3

This is also supported by the Wikipedia article on the quantification of accuracy and precision:
https://en.wikipedia.org/wiki/Accuracy_and_precision#Quantification

Perhaps the current implementation came about due to confusion about the magnitude of the uncertainty, and the +/- notation.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel set Security to None.

Bumping to "high", since this may cause confusion and (slight) data corruption.

Jonas renamed this task from Investigate quantification of quantity precision (+/- 1 or +/- 0.5) to [Task] Investigate quantification of quantity precision (+/- 1 or +/- 0.5).Aug 13 2015, 8:56 PM
Jc3s5h added a subscriber: Jc3s5h.Aug 31 2015, 5:26 PM
Jc3s5h added a comment.EditedAug 31 2015, 5:58 PM

The discussion of this in Land Surveyor Reference Manual 2nd ed., Andrew L. Harbin, Belmont CA: Professional Publications, pp. 1-8 through 1-10 agrees with my own experience. My experience includes creating voltage and current test specifications for mainframe computer chips.

A measurement usually contains a number of digits for which there is no doubt, and one more digit which has a similar magnitude as the uncertainty. Unless an unusual effort has been made to nail down the amount of uncertainty, it is worthless to include more than one uncertain digit in the quantity. If the uncertainty is important for the way the quantity is being used, the uncertainty will be specified explicitly, but for less critical use, it will be implied by only including one uncertain digit. So for less critical uses, 10.2±0.05, 10.2±0.1, or 10.2±0.4 would all be written 10.2. This does not lend itself to converting to an interval such as [10.1, 10.3]. Tough. The source decided not to record enough information to create that interval, so we can't.

Another point is that it is not normal in engineering or science when doing unit conversions, to try to keep traces of the original source units by keeping excessive significant digits in the uncertainty. If, for example, 10.2±0.2 m were converted to feet, it would be expressed as 33.5±0.7 ft. One reason for not attempting to keep traces of the original source units is that a calculation may contain units from several sources. For example, for a while in the US federally funded highway projects had to use metric units. The units going into a particular dimension might have come from a design value in meters, a distance measured in 1965 in feet, and an 18th century road width in rods. Following a non-standard procedure that only works for the simplest of calculations seems like a bad idea to me.

I would also mention that a quantity can be both approximate and exact at the same time. For example, the population of each state in the US is approximate; the Bureau of the Census does their best, but an exact count is just not feasible. But the count is treated as if it were an exact number when determining how many representatives in Congress each state gets, because attempting to treat the counts as approximations would just lead to endless bickering, if not another civil war.

thiemowmde added a project: DataValues.EditedSep 3 2015, 10:58 AM

How is this ticket related to T95425: [Bug] Quantity formatter rounding causes significant data loss? As far as I understand they are closely related, but:

  • This ticket is about the precision auto-detection in the parser. Typically, when a user just enters "5", the parser auto-detects the before/after lower/upper fields as "+4/+6" and the formatter finally renders this as "5+/-1". This ticket suggests to change this to "+4.5/+5.5" in the parser, while keeping the "5+/-1" in the formatter. Is this correct? Personally I do agree with this. Go for it.
  • T95425 is exclusively about the formatter truncating relevant information, e.g. "1.5+/-1" is wrongly rendered as "2+/-1".
Jc3s5h added a comment.EditedSep 3 2015, 1:22 PM

I think a few terms need to be corrected. Looking at the Wikidata Sandbox, which currently contains the numeric value 724.920±0.001

This is rendered in JSON as

{"mainsnak":{"snaktype":"value","property":"P1181","datavalue":{"value":{"amount":"+724.920","unit":"1","upperBound":"+724.921","lowerBound":"+724.919"},"type":"quantity"},"datatype":"quantity"},"type":"statement","id":"Q4115189$7056ca78-4ac9-4936-b168-9614a1d89ac3","rank":"normal"}

If I remember correctly, "before" and "after" are for dates, not quantities. (Dates are a horrible mess, but that's a topic for a different place.)

Except for using "before" and "after" I think thiemowmde's comment of Sep 3, 10:58 is correct.

kaldari added a comment.EditedSep 9 2015, 9:53 PM

I agree with Daniel that the magnitude of the uncertainly interval should match the magnitude of the least significant digit. In other words, the default precision (if none is specified) should be +/-0.5, not +/-1. The whole point of significant digits is that you can assume they represent an error interval that is equivalent to the magnitude of the digit's place. In other words, if you measured the quantity more precisely, and rounded to the original significant digit, you would still get the same value. If that isn't true, you're supposed to add an explicit margin of error (e.g. +/-100). By having a default precision of +/-1, we are suggesting that no values in Wikidata can be taken at face value (i.e. 845 might actually mean 844 or 846). This is both confusing and inaccurate (in the majority of cases).

If the least significant digit is m time ten to the nth power, I would set the default uncertainty to 5 times ten to the nth power, which is the greatest uncertainty that might apply in that situation. If the uncertainty were any worse, the person who wrote the value should have used fewer significant digits. In other words, the default should be the worst reasonable uncertainty; let the editor specify if the error is better than this.

*bump* because of units

If the least significant digit is m time ten to the nth power, I would set the default uncertainty to 5 times ten to the nth power, which is the greatest uncertainty that might apply in that situation. If the uncertainty were any worse, the person who wrote the value should have used fewer significant digits. In other words, the default should be the worst reasonable uncertainty; let the editor specify if the error is better than this.

On further reflection, I would do what I described above, unless there was only one significant digit. In that case I would set the precision to 1 times ten to the n. Examples: 7000 ± 1000; 1920 ± 50.

  • This ticket is about the precision auto-detection in the parser. Typically, when a user just enters "5", the parser auto-detects the before/after lower/upper fields as "+4/+6" and the formatter finally renders this as "5+/-1". This ticket suggests to change this to "+4.5/+5.5" in the parser, while keeping the "5+/-1" in the formatter. Is this correct? Personally I do agree with this. Go for it.

To clarify: if the parser detected "+4.5/+5.5" for the input 5, the formatter should output this as "5+/-0.5" - or, if we decide to omit the uncertainty in the output if it's better than or equal to the default, just "5".

daniel added a comment.EditedSep 11 2015, 10:04 PM

On further reflection, I would do what I described above, unless there was only one significant digit. In that case I would set the precision to 1 times ten to the n. Examples: 7000 ± 1000; 1920 ± 50.

We cannot assume that trailing zeros are insignificant. The default uncertainty for 7000 should be the same as the default uncertainty of 7777 (currently +/-1, possibly to be +/-0.5 in the future). "seven thousand" can be written as 7e3 to indicate that only the 7 is significant. This would currently be interpreted as 7000+/-1000 (possibly to become +/-500 in the future).

We cannot assume that trailing zeros are insignificant. The default uncertainty for 7000 should be the same as the default uncertainty of 7777 (currently +/-1, possibly to be +/-0.5 in the future). "seven thousand" can be written as 7e3 to indicate that only the 7 is significant. This would currently be interpreted as 7000+/-1000 (possibly to become +/-500 in the future).

Yes, we can interpret trailing zeros to the left of the decimal point as not significant, unless the tolerance is specified explicitly with the ± symbol That is standard practice in science and engineering. For example, Land Surveyor Reference Manual 2nd ed. by Andrew Harbin p. 108 states:

"A zero is not significant if it occurs at the end of a measured number unless information is available which indicates that it is."

Some people may be ignorant of this convention, but by default we should adopt the greatest reasonable uncertainty.

daniel added a comment.EditedSep 14 2015, 6:38 PM

@Jc3s5h Hm... it doesn't seem to be that clear cut. A quick search brought up a somewhat inconclusive picture:

Frontiers of Science: Scientific Habits of Mind (Columbia University) seems to agree with you:

  1. Trailing zeros in a whole number with no decimal shown are NOT significant. Writing just "540" indicates that the zero is NOT significant, and there are only TWO significant figures in this value.

Wikipedia (referencing Higham, Nicholas (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. p. 3. and Myers, R. Thomas; Oldham, Keith B.; Tocci, Salvatore (2000). Chemistry) however sais:

The significance of trailing zeros in a number not containing a decimal point can be ambiguous. For example, it may not always be clear if a number like 1300 is precise to the nearest unit (and just happens coincidentally to be an exact multiple of a hundred) or if it is only shown to the nearest hundred due to rounding or uncertainty.

The ChemTeam Tutorial for High School Chemistry puts it this way:

How will you know how many significant figures are in a number like 200? In a problem like below, divorced of all scientific context, you will be told. If you were doing an experiment, the context of the experiment and its measuring devices would tell you how many significant figures to report to people who read the report of your work.

In other words: context is needed.

In practice, assuming that 3000 has just 1 significant digit is going to be right in 90% of the cases (exactly 90%: the chance that the last digit is 0 by accident is 1:9). 90% sounds pretty good, but it's going to be wrong thousands of times per day. Erring on the side of caution will more often be wrong, but will not lead to loss of (displayed) information.

I'm not sure what would be best. Since this kind of "guessing" only occurs during interactive input, we could improve the interaction to avoid mistakes. For example, when the user enters trailing zeros, we could show a popup asking them to tell us the number of significant digits; we could also gray out any insignificant digits in a preview/feedback display. This needs more thought, and probably a separate ticket.

This ticket is about deciding whether the magnitude of the default uncertainty should be halved. And I'm increasingly convinced that it should. Not surprising, since that was my initial intuition. But I want to be really sure before making a change like this. Being forced to change it again later would be awkward.

The fact that the parser "guesses" a precision based on basically zero information always was and still is wrong. It must default to ±0. Everything else is misleading and a source of significant confusion and actual errors.

I'm not sure if this is the correct ticket for that since we already have quite a lot related to this artificial and still unresolved issue.

What the user enters must be respected. When he enters "6" he does not mean "maybe 5, maybe 7, I do not know". This is misleading. He entered "6". Nothing else. That's what we know and what we must store. Same is not only true for integers but for all precisions, may it be 6 million or 1.06. When the user enters "1.06" he did not meant to enter "I do have a number here but I'm not sure if it's 1.05 or 1.07, it could be something in between". When he enters "1.06" he does this because this is what's stated in the source. Exactly this and not something else. Assuming anything is just plain wrong.

  • "1.06 feet" is entered via the UI or API. It must be stored as { amount: 1.06, unit: feet, upperBound: 1.06, lowerBound: 1.06 }. This is what we know at this point. Nothing else. There is no "false precision" in this raw value.[*]
  • 1 ft = 0.3048 m, so when converting 1.06 feet to meter it would become 0.323088 m.
  • At this point, this is false precision. The original value was not that precise. The original 1.06 feet had 2 decimal places, the last being 0.01 feet = 0.003048 m.
  • This, along with the discussion above, means the precision of the converted value is at best ±0.05 feet = ±0.001524 meter.
  • This is what we store: { amount: 0.323088, unit: meter, upperBound: 0.324612, lowerBound: 0.321564 }.
  • A naive formatter would render this as 0.323088±0.001524 m.
  • A formatter with a rounding option could render this as 0.3231±0.0015 m or 0.323±0.002 m, but should never render this as 0.32±0.00 m.
  • When converted back to feet, 0.323088±0.001524 m becomes 1.06±0.005 and is stored as such, resulting in { amount: 1.06, unit: feet, upperBound: 1.065, lowerBound: 1.055 }. Yes, this is different from the original, raw value. It must be. Conversion was applied twice. We can not be sure any more what the original intent was.
  • When converting one of the rounded values back:
    • 0.3231±0.0015 m becomes 1.06003937±0.00492126 feet. A formatter with rounding can render this as 1.06±0.005 feet.
    • 0.323±0.002 m becomes 1.059711286±0.00656168 feet. A formatter with rounding can render this as 1.06±0.007 feet.

TL;DR: Store raw values with ±0. Recalculate the ± value and let it grow every time a value is converted.

[*]We could also store { upperBound: unknown, lowerBound: unknown }. This would not change the arguments below.

I must take issue with thiemowmde's argument:

  • "At this point, this is false precision. The original value was not that precise. The original 1.06 feet had 2 decimal places, the last being 0.01 feet = 0.003048 m."

Suppose we do as thiemowmde suggests; when the user enters 1.06 feet in the user interface, with no ± character and no uncertainty specified, we store { amount: 1.06, unit: feet, upperBound: 1.06, lowerBound: 1.06 } We no longer have the user's original input; we now have a declaration that the value is exactly 1.06 feet. If the user had entered 1.06±0 because the user knew the value was exact, it would be stored exactly the same way. So if it is converted to meters the value becomes 0.323088±0 m.

I agree with Jc3s5h. Yes, using +/-1 is basically wrong, but we shouldn't throw out the idea of significant digits entirely. We should just implement them correctly. Otherwise, our converted values will have false certainty in most cases.

The fact that the parser "guesses" a precision based on basically zero information always was and still is wrong. It must default to ±0. Everything else is misleading and a source of significant confusion and actual errors.

A better approach if possible might be to default to -1, or something else indicating the absence of data, but otherwise I completely agree with this.

As posted on wikidata-l, taking the current approach of +- 1 SF is a really bad idea. As an example, say you have a length of 100m. Which significant digit do you assume is correct? Is this +- 100m, 10m or 1m? What if it's referring to the length of a 100m run, where the accuracy could be much higher than the significant digit given, e.g. 100m +- 1cm? Or what if it's the size of a crater on a distant planet, where it might be 100m+-50m? Or if the actual value is 100m +- 3m, but we say that it's +- 1m (which I see is the default in this case), which might be believable to readers but very misleading in reality?

If there's no uncertainty given, then please just say that rather than trying to make one up!

I must take issue with thiemowmde's argument:

  • "At this point, this is false precision. The original value was not that precise. The original 1.06 feet had 2 decimal places, the last being 0.01 feet = 0.003048 m."

Suppose we do as thiemowmde suggests; when the user enters 1.06 feet in the user interface, with no ± character and no uncertainty specified, we store { amount: 1.06, unit: feet, upperBound: 1.06, lowerBound: 1.06 } We no longer have the user's original input; we now have a declaration that the value is exactly 1.06 feet. If the user had entered 1.06±0 because the user knew the value was exact, it would be stored exactly the same way. So if it is converted to meters the value becomes 0.323088±0 m.

In that case, it would make sense to round the result to 0.32m, i.e. to lose some of the precision in the original number to handle the absence of an uncertainty. Where the uncertainty in the first number is known, then this uncertainty should be propagated (see https://en.wikipedia.org/wiki/Propagation_of_uncertainty ) and the numbers given appropriately, i.e. to 1SF of the newly calculated uncertainty (although this is sometimes given to higher accuracy where the first significant digit in the uncertainty is 1, e.g. 0.11 or 0.19, to permit more accurate unit conversions in the future).

As posted on wikidata-l, taking the current approach of +- 1 SF is a really bad idea. As an example, say you have a length of 100m. Which significant digit do you assume is correct? Is this +- 100m, 10m or 1m? What if it's referring to the length of a 100m run, where the accuracy could be much higher than the significant digit given, e.g. 100m +- 1cm? Or what if it's the size of a crater on a distant planet, where it might be 100m+-50m? Or if the actual value is 100m +- 3m, but we say that it's +- 1m (which I see is the default in this case), which might be believable to readers but very misleading in reality?

The problem with defaulting to +/-0 or "unknown" significance is that we can't apply rounding - and if we can't apply rounding, we can't apply conversion without introducing false precision. For example, 2m +/-0 would convert to 6.56167979ft +/-0. This implies a precision to sub-micron levels, which would very likely be wrong, and also not useful in infoboxes. If however we assume 2m+/-0.5, that converts to 7ft (+/-1.6). That's much more sensible, and avoids false precision.

If we did not plan to support unit conversion, I would be ready to go along with your argument. We would simply say we don't know th precision. With unit conversion however we can't do this. And relying on the conventions for specifying significant digits seems the best we can do.

If there's no uncertainty given, then please just say that rather than trying to make one up!

We are not making one up. The precision is given implicitly in the decimal notation of the number, using the convention of significant digits. This is quite unambiguous for cases like 3.20 (three significant digits) or 2.3e3 (2300 with two significant digits). It's ambiguous for input like 200 or 1700 - there's a good chance that the zeros are insignificant, but we don't really know. We should improve our UI to help the user with correct input.

As posted on wikidata-l, taking the current approach of +- 1 SF is a really bad idea. As an example, say you have a length of 100m. Which significant digit do you assume is correct? Is this +- 100m, 10m or 1m? What if it's referring to the length of a 100m run, where the accuracy could be much higher than the significant digit given, e.g. 100m +- 1cm? Or what if it's the size of a crater on a distant planet, where it might be 100m+-50m? Or if the actual value is 100m +- 3m, but we say that it's +- 1m (which I see is the default in this case), which might be believable to readers but very misleading in reality?

The problem with defaulting to +/-0 or "unknown" significance is that we can't apply rounding - and if we can't apply rounding, we can't apply conversion without introducing false precision. For example, 2m +/-0 would convert to 6.56167979ft +/-0. This implies a precision to sub-micron levels, which would very likely be wrong, and also not useful in infoboxes. If however we assume 2m+/-0.5, that converts to 7ft (+/-1.6). That's much more sensible, and avoids false precision.
If we did not plan to support unit conversion, I would be ready to go along with your argument. We would simply say we don't know th precision. With unit conversion however we can't do this. And relying on the conventions for specifying significant digits seems the best we can do.

That makes sense in the back-end to make sure that converted values have reasonable levels of precision, but does it have to be in the front-end as well, or stored in the database? A line of code that checks whether the uncertainty has been set or not, and assumes a minimum uncertainty for conversion purposes, should handle this issue smoothly, without mis-estimates of the uncertainty of the given value being displayed to readers.

If there's no uncertainty given, then please just say that rather than trying to make one up!

We are not making one up. The precision is given implicitly in the decimal notation of the number, using the convention of significant digits. This is quite unambiguous for cases like 3.20 (three significant digits) or 2.3e3 (2300 with two significant digits). It's ambiguous for input like 200 or 1700 - there's a good chance that the zeros are insignificant, but we don't really know. We should improve our UI to help the user with correct input.

I'm not convinced that the implicit assumption you're making here will work for most situations, so it really shouldn't be displayed to the reader. We should definitely be encouraging editors to add more accurate estimates of uncertainties at the same as the numbers are added though!

One thing I'm particularly worried about here is that there doesn't seem to be a good way to tell assumed uncertainties and referenced uncertainties apart - which will be a huge headache to fix once this data format is in common usage! So please, let's get this right asap!

If we did not plan to support unit conversion, I would be ready to go along with your argument. We would simply say we don't know th precision. With unit conversion however we can't do this. And relying on the conventions for specifying significant digits seems the best we can do.

That makes sense in the back-end to make sure that converted values have reasonable levels of precision, but does it have to be in the front-end as well, or stored in the database? A line of code that checks whether the uncertainty has been set or not, and assumes a minimum uncertainty for conversion purposes, should handle this issue smoothly, without mis-estimates of the uncertainty of the given value being displayed to readers.

We can of course discuss if, when and how the explicit +/-X is shown to the user. I'm completely open to that. One sensible suggestion was to hide it if the actual uncertainty is the same as what we would assume from the decimal representation. In that case, it's OK to hide it, I think. Maybe also if the precision is better than what we would assume. Maybe. But in any case it's crucial to understand that we *have* do consider uncertainty everywhere if we want to allow conversion.

I think it makes sense to store the uncertainty in the database, since *if* we assume an uncertainty at some point, users should be able to see, check, modify, and compare it. Also, we need to be able to apply unit conversion for queries, otherwise we couldn't compare feet to meter. And we have to take uncertainty into account, so we know that 2m +/- 0.5 "matches" 7.2ft +/-0.1. it's not *exactly* the same of course, but these two values were not exact to begin with, so they should match.

We could store "unknown", and then re-calculate the uncertainty every time we need it, but why? What would that gain us?

We are not making one up. The precision is given implicitly in the decimal notation of the number, using the convention of significant digits. This is quite unambiguous for cases like 3.20 (three significant digits) or 2.3e3 (2300 with two significant digits). It's ambiguous for input like 200 or 1700 - there's a good chance that the zeros are insignificant, but we don't really know. We should improve our UI to help the user with correct input.

I'm not convinced that the implicit assumption you're making here will work for most situations, so it really shouldn't be displayed to the reader. We should definitely be encouraging editors to add more accurate estimates of uncertainties at the same as the numbers are added though!

I absolutely agree.

One thing I'm particularly worried about here is that there doesn't seem to be a good way to tell assumed uncertainties and referenced uncertainties apart - which will be a huge headache to fix once this data format is in common usage! So please, let's get this right asap!

Well, in scientific literature at least, a number like 2.30 or 2.3e3 has a definite uncertainty (resp significant digits). It's given by convention of the notation. Would you consider that a guess, or a sourced uncertainty?

In T105623#1657039, @daniel wrote, in part:

We can of course discuss if, when and how the explicit +/-X is shown to the user. I'm completely open to that. One sensible suggestion was to hide it if the actual uncertainty is the same as what we would assume from the decimal representation. In that case, it's OK to hide it, I think. Maybe also if the precision is better than what we would assume. Maybe. But in any case it's crucial to understand that we *have* do consider uncertainty everywhere if we want to allow conversion.

I would always show the uncertainty if it comes from a source. This would let editors know that checking the uncertainty of a number is a lower priority than unsourced guesses about uncertainty. It also lets a reader know the referenced source could be checked to verify the uncertainty, in case what the reader was really interested in was the uncertainty of the number.

We could store "unknown", and then re-calculate the uncertainty every time we need it, but why? What would that gain us?

We could store the guess about uncertainty, but also mark it as unknown, so data consumers would be on notice they really ought to find a better source if they care about the uncertainty.

Well, in scientific literature at least, a number like 2.30 or 2.3e3 has a definite uncertainty (resp significant digits). It's given by convention of the notation. Would you consider that a guess, or a sourced uncertainty?

In a scientific source I would certainly consider 2.30 or 2.3e3 as a sourced uncertainty if it was from a scientific source. For a number like 2300, I would also regard it as a sourced uncertainty. But I would also infer that the uncertainty of the number was not especially important in the article, or that the article, although from a scientific organization, was intended for a popular audience, or both. An example is a recent press release from the US Geological Survey, giving the elevation of Mt. Denali to the nearest foot. We can tell it is intended for a popular audience because the uncertainty was not explicitly stated, and because the elevation was given only in feet. One would expect that when the peer-reviewed journal article comes out, the primary unit of length will be the meter, with perhaps an occasional conversion to feet.

daniel added a comment.EditedSep 21 2015, 1:47 PM

@Jc3s5h what, then, would be an example for a reliable/acceptable source giving us a number with no hint at the uncertainty? When should we consider an uncertainty unsourced? If Nature gives the size of a crater on Mars in kilometers, what uncertainty should we assume, and should it be considered sourced?

I'm afraid the distinction of sourced vs unsourced uncertainty makes things harder to handle in code and more difficult to understand for users.

I suggest we do what we always do, really: we assume that people follow the establish conventions when entering data. Most people never think of significant digits or uncertainties explicitly, but we all used the concept intuitively, all the time, when we say that the store is "two hundreds yards away" or it's "170 Miles to Sometown".

@Jc3s5h what, then, would be an example for a reliable/acceptable source giving us a number with no hint at the uncertainty? When should we consider an uncertainty unsourced? If Nature gives the size of a crater on Mars in kilometers, what uncertainty should we assume, and should it be considered sourced?

If a number is published without an uncertainty next to it, then assuming an uncertainty definitely shouldn't be counted as 'sourced'! Unless the article/journal specifically states that all numbers have an uncertainty of 1 in their last significant digit.

I'm afraid the distinction of sourced vs unsourced uncertainty makes things harder to handle in code and more difficult to understand for users.

I think this is vital, though. How else would you (or reusers) tell whether a number *actually* has an uncertainty of 1 in the last significant digit or whether that has just been assumed for conversion purposes?

I suggest we do what we always do, really: we assume that people follow the establish conventions when entering data. Most people never think of significant digits or uncertainties explicitly, but we all used the concept intuitively, all the time, when we say that the store is "two hundreds yards away" or it's "170 Miles to Sometown".

Please just keep it simple. Accept the given central value, but don't automatically assume an uncertainty for it, and don't show that in the user interface. Ask people to provide uncertainties wherever possible. If there isn't a given uncertainty, then use the number of significant digits when converting numbers to make sure that the post-conversion number has sensible numbers of digits, and include a discussion of that in the documentation describing the conversion process.

If we did not plan to support unit conversion, I would be ready to go along with your argument. We would simply say we don't know th precision. With unit conversion however we can't do this. And relying on the conventions for specifying significant digits seems the best we can do.

That makes sense in the back-end to make sure that converted values have reasonable levels of precision, but does it have to be in the front-end as well, or stored in the database? A line of code that checks whether the uncertainty has been set or not, and assumes a minimum uncertainty for conversion purposes, should handle this issue smoothly, without mis-estimates of the uncertainty of the given value being displayed to readers.

We can of course discuss if, when and how the explicit +/-X is shown to the user. I'm completely open to that. One sensible suggestion was to hide it if the actual uncertainty is the same as what we would assume from the decimal representation. In that case, it's OK to hide it, I think. Maybe also if the precision is better than what we would assume. Maybe. But in any case it's crucial to understand that we *have* do consider uncertainty everywhere if we want to allow conversion.

That wouldn't work: the uncertainty should be shown if it is an accurate/referenced uncertainty, and that shouldn't depend on whether it's more or less than the assumed uncertainty. We should simply say what the uncertainty is if we have it, or say that we don't have an uncertainty if we don't.

I think it makes sense to store the uncertainty in the database, since *if* we assume an uncertainty at some point, users should be able to see, check, modify, and compare it. Also, we need to be able to apply unit conversion for queries, otherwise we couldn't compare feet to meter. And we have to take uncertainty into account, so we know that 2m +/- 0.5 "matches" 7.2ft +/-0.1. it's not *exactly* the same of course, but these two values were not exact to begin with, so they should match.
We could store "unknown", and then re-calculate the uncertainty every time we need it, but why? What would that gain us?

It would be an accurate way to represent the data that we have, and to clearly mark where we don't have uncertainties. It would avoid corrupting the database by mixing sourced and assumed uncertainties. We shouldn't be encouraging people to alter the assumed uncertainty used for conversion purposes (which they might do, e.g. to tweak how the converted number shows), as that would corrupt the database even more - we should instead be asking them to source the actual uncertainties. IMO there's a lot of up-sides to adopting this approach, and no significant down-sides.

We are not making one up. The precision is given implicitly in the decimal notation of the number, using the convention of significant digits. This is quite unambiguous for cases like 3.20 (three significant digits) or 2.3e3 (2300 with two significant digits). It's ambiguous for input like 200 or 1700 - there's a good chance that the zeros are insignificant, but we don't really know. We should improve our UI to help the user with correct input.

I'm not convinced that the implicit assumption you're making here will work for most situations, so it really shouldn't be displayed to the reader. We should definitely be encouraging editors to add more accurate estimates of uncertainties at the same as the numbers are added though!

I absolutely agree.

One thing I'm particularly worried about here is that there doesn't seem to be a good way to tell assumed uncertainties and referenced uncertainties apart - which will be a huge headache to fix once this data format is in common usage! So please, let's get this right asap!

Well, in scientific literature at least, a number like 2.30 or 2.3e3 has a definite uncertainty (resp significant digits). It's given by convention of the notation. Would you consider that a guess, or a sourced uncertainty?

It's a guess unless it's explicitly stated that the uncertainty is at that level, or that the work is following that convention. The standard approach in astronomy (which is the part of the scientific literature that I'm most familiar with, as a scientist working in that field) is to quote a number along with the uncertainty and the significance level associated with that uncertainty.

In T105623#1657997, @daniel wrote in part:

@Jc3s5h what, then, would be an example for a reliable/acceptable source giving us a number with no hint at the uncertainty? When should we consider an uncertainty unsourced? If Nature gives the size of a crater on Mars in kilometers, what uncertainty should we assume, and should it be considered sourced?

We must also allow for the case where there is no reliable source. Normally a reliable source would give us an uncertainty, implicitly or explicitly; the main exception that comes to mind would be a quantity that is just mentioned in passing; something that is not the main focus of the document. "Explicitly" would include a description of the method used to determine the quantity, including the accuracy of the method, even if the description was in a different part of the document. "Implicitly" usually be the number of significant figures for the item in Wikidata, together with the number of significant figures for other items in the source measured in the same way. If a source described a method for measuring the elevation of mountain summits, then said the elevation of Mt. X was 2013 m, Mt. Y was 2000 m, and Mt. Z was 7253 m, we have a sourced statement that the uncertainty of Mt Y is 2000 m ± a few meters.

I'm afraid the distinction of sourced vs unsourced uncertainty makes things harder to handle in code and more difficult to understand for users.
I suggest we do what we always do, really: we assume that people follow the establish conventions when entering data. Most people never think of significant digits or uncertainties explicitly, but we all used the concept intuitively, all the time, when we say that the store is "two hundreds yards away" or it's "170 Miles to Sometown".

I agree.

In T105623#1660251, @Mike_Peel wrote, in part:

...The standard approach in astronomy (which is the part of the scientific literature that I'm most familiar with, as a scientist working in that field) is to quote a number along with the uncertainty and the significance level associated with that uncertainty.

I think that's the standard approach in most fields of science and engineering, in the most serious works, for numbers that are the main focus of the article, chapter, book, etc. But these kind of sources are not always readily available to Wikidata contributors. Wikidata contributors may use other databases, or articles intended for a popular audience, which are reliable but lack explicit statements about uncertainty. The Wikidata numbers might also come from sources that mention a number in passing and so do not explicitly state an uncertainty.

Unfortunately trying to impose a rule that only the best sources may be used to introduce data into Wikidata just isn't going to happen.

daniel added a comment.EditedSep 21 2015, 6:43 PM

It seems like are are approaching an agreement on a few points:

  • it's nearly always wrong to assume absolute precision (+/-0) per default (notable exceptions are definitions and exact counts).
  • it's important to apply rounding based on uncertainty (resp significant digits) when converting, to avoid the introduction of false precision ("spurious" digits). This applies to conversion for display and also to normalization for indexing/querying.
  • the magnitude of the uncertainty interval should be order of magnitude of the least significant digit (not twice that -- so +/-0.5, not +/-1).

These are the most crucial points to me. Points that are still open are:

  • if no uncertainty is given in the input, should we derive and store it immediately? Or should we then store "unknown" uncertainty, and calculate the uncertainty interval when needed?
  • should we should the uncertainty interval per default if it was not explicitly entered?
  • should we should the uncertainty interval per default if it is different from the one we would have guessed?

I think we should reach an agreement about these as soon as possible, to avoid more "bad" data in the database.

It seems like are are approaching an agreement on a few points:

  • it's nearly always wrong to assume absolute precision (+/-0) per default (notable exceptions are definitions and exact counts).

I don't think anyone was suggesting that. Using a default of 0 in the database would be one way of recording that the uncertainty is unknown, not assuming absolute precision.

  • it's important to apply rounding based on uncertainty (resp significant digits) when converting, to avoid the introduction of false precision ("spurious" digits). This applies to conversion for display and also to normalization for indexing/querying.

Agree

  • the magnitude of the uncertainty interval should be order of magnitude of the least significant digit (not twice that -- so +/-0.5, not +/-1).

Agree

These are the most crucial points to me. Points that are still open are:

  • if no uncertainty is given in the input, should we derive and store it immediately? Or should we then store "unknown" uncertainty, and calculate the uncertainty interval when needed?

The latter, please.

  • should we should the uncertainty interval per default if it was not explicitly entered?
  • should we should the uncertainty interval per default if it is different from the one we would have guessed?

I think these got mangled?

I think we should reach an agreement about these as soon as possible, to avoid more "bad" data in the database.

Agree

@Mike_Peel Thanks for the links! Especially M3003 looks like a very useful reference.

  • it's nearly always wrong to assume absolute precision (+/-0) per default (notable exceptions are definitions and exact counts).

I don't think anyone was suggesting that. Using a default of 0 in the database would be one way of recording that the uncertainty is unknown, not assuming absolute precision.

We still need a way to specify absolute precision when applicable. For instance, a foot is exactly 0,3048m, because it is defined to be so. The speed of light is exactly 299792458m/s, because that's how the meter is defined.

  • should we should the uncertainty interval per default if it is different from the one we would have guessed?

I think these got mangled?

That's the sentence as I intended to write it... I'll try to rephrase:

We could omit the uncertainty from output if it is exactly what is implied by the decimal notation, using the applicable convention about significant digits. For example, if our algorithm would produce +/-0.5 for the input "3m", then 3m+/-0.5 would be written as simply 3m (because the stored uncertainty is equal to the uncertainty implied by the number as written). This may be a viable option to un-clutter the user visible output if we decide to always store the uncertainty interval explicitly, as we do now.

@Mike_Peel Thanks for the links! Especially M3003 looks like a very useful reference.

  • it's nearly always wrong to assume absolute precision (+/-0) per default (notable exceptions are definitions and exact counts).

I don't think anyone was suggesting that. Using a default of 0 in the database would be one way of recording that the uncertainty is unknown, not assuming absolute precision.

We still need a way to specify absolute precision when applicable. For instance, a foot is exactly 0,3048m, because it is defined to be so. The speed of light is exactly 299792458m/s, because that's how the meter is defined.

That's true.

  • should we should the uncertainty interval per default if it is different from the one we would have guessed?

I think these got mangled?

That's the sentence as I intended to write it... I'll try to rephrase:
We could omit the uncertainty from output if it is exactly what is implied by the decimal notation, using the applicable convention about significant digits. For example, if our algorithm would produce +/-0.5 for the input "3m", then 3m+/-0.5 would be written as simply 3m (because the stored uncertainty is equal to the uncertainty implied by the number as written). This may be a viable option to un-clutter the user visible output if we decide to always store the uncertainty interval explicitly, as we do now.

Did you mean 'show' rather than the second 'should' in each of those sentences? I firmly hold that we should only be showing the uncertainty if it has been entered by the editor, and we should show that uncertainty regardless of whether it matches an assumed uncertainty or not.

@Mike_Peel "show", of course! It's amazing how blind I am when I already "know" what I am reading...

Speaking as someone who typical updates entries that are not scientific in nature, defaulting to any level of precision other than ±0 is incorrect and infuriating. When I enter a numerical value for something I want the displayed and stored data to match the input I give exactly. e.g. when I say the number of trains on a particular funicular railway is 2, assuming I mean 2±1 is incorrect. When I input the length of the Sheffield Supertram system as 29km, assuming I mean 29±1km is incorrect - I assume it's actually 29±0.5km but the source does not say. When I enter the width for 2134mm track gauge as 7ft, assuming I mean 7±1 ft is incorrect - the gauge is defined as a nominal 7ft exactly, with different actual spacing and different tolerances in specific applications.

The simplest way around this from an end users point of view that I can think is to have a qualifier associated with all numerical values that is used to record the uncertainty. If this qualifier is not present then assume the uncertainty is unknown. In all cases where the entry is currently ±1 or ±0.5 we will have to assume the uncertainty is unknown - this will be correct in far more cases than it is not.

@Thryduulf Thanks for the input. I See where you are coming from. The trouble is: in order to apply conversion and meaningful comparison, we have to assume some level of uncertainty. Assuming +/-0 leads to strange results for display, and mismatches in queries.

It seems like an agreement is forming that we should have an explicit notion for "no uncertainty given", and not display an "assumed" uncertainty in such cases. But we still have to assume some uncertainty in order to perform meaningful calculations without introducing spurious digits. Assuming +/-0 would be wrong; when you say 27km, you do not mean 27km, 0meters, 0millimeters, 0micrometers, and 0 nanometers, as +/-0 would indicate. You probably mean "27km, give or take a few hundred meters" -- which is what 27km +/-0.5 means: it basically says "to the nearest kilometer".

So while there is definitely room for improvement and discussion about when and how to "guess" uncertainty, I see no way to get around it entirely.

Klortho removed a subscriber: Klortho.Sep 22 2015, 2:16 PM
In T105623#1660386, @daniel wrote in part:

.
.
.

  • the magnitude of the uncertainty interval should be order of magnitude of the least significant digit (not twice that -- so +/-0.5, not +/-1).

That's OK for a guessed uncertainty, but is not a requirement for a specified uncertainty.

Any news on progressing this issue?

Random PDFs found via this Google search:

Pro ±1

http://physicsed.buffalostate.edu/pubs/MeasurementAnalysis/MA1_9ed.pdf says:

Determining uncertainties is a bit more challenging since you—not the measuring device— must determine them. When determining an uncertainty from a measuring device, you need to first determine the smallest quantity that can be resolved on the device. Then [...] the uncertainty in the measurement is taken to be this value. For example, if a digital readout displays 1.35 g, then you should write that measurement as (1.35 ± 0.01) g. The smallest division you can clearly read is your uncertainty.

Note that this document is mostly talking about reading from analog devices, even if the quoted text says "digital readout". Personally, I think there is something confused.

Pro ±0.5

https://www.wmo.int/pages/prog/gcos/documents/gruanmanuals/UK_NPL/mgpg11.pdf says:

The divisions on the tape are millimetres. Reading to the nearest division gives an error of no more than ±0.5 mm. We can take this to be a uniformly distributed uncertainty (the true readings could lie variously anywhere in the 1 mm interval - i.e. ±0.5 mm).

http://www.calpoly.edu/~gthorncr/ME236/documents/Exp.1.IntroductiontoMeasurement.pdf says:

The procedure for taking [a] reading [from a digital display] is simple: Record exactly what you read from the digital display. [...] 0.30 g. Note that a value of 0.3 g or 0.300 g is incorrect, because either value misrepresents the resolution of the device. [...] the true value of the mass is within 0.2950 ≤ mass ≤ 0.3049. This is implied because the device will round any value in this range to 0.30, the closest reading on the display. It is standard to record this measurement as mass = 0.30 ± 0.005 g. In the above format, the value 0.30 is called the nominal value. The second term, ± 0.005, is called the reading error. It is equal to one-half the resolution (the smallest division between readings, sometimes called the least count for digital devices).

daniel added a comment.Oct 9 2015, 2:41 PM

Thank you for digging this up, Thiemo!

I think the last document you quote is most relevant to our use case: we need the uncertainty interval to apply correct rounding for display, especially after unit conversion. As Kaldari mentioned earlier, using +/- 1 is inconsistent with the rounding algorithms we use, since 17+/-1 would include 17.9, which would round to 18.

For the case that no uncertainty is explicitly given, I tend towards not deriving and saving an uncertainty, but instead we derive it when we need it. We could even work with different approaches for deriving the uncertainty interval for different use cases (though that could quickly get confusing).

In API output, we would include upper and lower bound only if explicitly given, and for the normalized value. For example, 6ft would be represented as 6ft (no uncertainty interval provided) with the normalized value of 1.8288+/-0.3048, which would display as 2m.

thiemowmde's quote from the buffalostate.edu site uses ellipsis to omit a critical passage: "Then, for your work in PHYS 152L, the uncertainty in the measurement is taken to be this value." This document is used in conjunction with an (apparently undergraduate) university course which adopts some shortcuts and simplifications for expediency in the course. The ability to read the display on an instrument is a lower bound on the uncertainty. Often an instrument will have other limitations that will cause the actual uncertainty to be greater than the uncertainty in reading the display. A simple example is that a meter stick bought at the local hardware store (UK: ironmonger) for $10 will probably disagree by more than one scale division from a quality engine-divided meter scale bought from a reliable manufacturer such as Starett for several hundred dollars.

@Jc3s5h Well, if we have additional information about the methods and tools used for the measurement, we should use them, sure. The source should state them, and we should record them. That's the simple case.

But the question we are trying to answer here is what to use if we have no such information. All we have to go by is a decimal string, and some inconclusive conventions about significant digits. So, what's the best for our primary use case, namely rounding after unit conversion? We could

  1. assume +/-0: This will introduce spurious digits (false precision), because it prevents any rounding to be applied.
  2. assume +/-1: This is inconsistent with the rounding algorithm we apply ("round half away from zero"): a nominal value of 17+/-1 includes 17.9, which would round to 18.
  3. assume +/-0.5: This is consistent with rounding, and does not lead to false precision. As far as I can tell, it doesn't lead to surprises.

We have a bit more freedom with our second use case, comparison during queries, since uncertainty and rounding are not directly visible to the user. However, it seems prudent to use the same approach for both cases, to avoid confusion and inconsistencies.

Here is another take on the 1 vs 0.5 issue, from Uncertainties and Significant Figures http://facultyfiles.deanza.edu/gems/lunaeduardo/UncertaintyandSignificantFig.pdf (a random find, not an authoritative source):

  1. Uncertainty in a Scale Measuring Device is equal to the smallest increment divided by 2.
  2. Uncertainty in a Digital Measuring Device is equal to the smallest increment.

Ex. Meter Stick (scale device):
σ_x = 1mm/2 = 0.5mm = 0.05cm
Ex. Digital Balance (digital device):
0.7513kg
σ_x = 0.0001kg
For example, if we measure a length of 5.7 cm with a meter stick, this implies that the length can be anywhere in the range 5.65 cm ≤ L ≤ 5.75 cm.
Thus, L =5 .7 cm measured with a meter stick implies an uncertainty of 0.05 cm. A common rule of thumb is to take one-half the unit of the last
decimal place in a measurement to obtain the uncertainty.

Are we still investigating this or have we reached the conclusion that +/- 0.5 is the most sensible default for rounding and unit conversion?

Having a default in general is not reasonable.

If there is a sourced uncertainty, the input has to include exactly this uncertainty, not a default one.

It here is no sourced uncertainty, guessing one is simply wrong. It is neither +/-1 nor +/-0,5, but it is unknown. This has to be expressed that way.

Moreover, currently the huge amount of exact count values have the uncernaity +/-1. This leads to a huge amount of WRONG data. The number of games a sports player has played might be 15, the displayed number is 15+/-1. This is absolutely confusing and does not lead to an improvement, but in the opposite.

Therefore, please consider the removal of a default uncertainty at least for whole numbers (integers). In most cases, any guessed uncertainty is wrong.

@Yellowcard: We have to choose an uncertainty in order to do unit conversion. There's no way around that fact.

@kaldari: There's nothing against uncertainty, but the default uncertainty for values that don't have any. I'm especially talking about counting values without any unit.

For what I understand, unit conversation should also be possible without known uncertainty?

@Yellowcard:

There's nothing against uncertainty, but the default uncertainty for values that don't have any. I'm especially talking about counting values without any unit.

For counted values, there is another bug, T68580, which proposes a new "amount" or "number" datatype. The bug here is for how to correctly handle values that may have uncertainty. This should be more clear once T68580 is resolved.

For what I understand, unit conversation should also be possible without known uncertainty?

Yes, but see T68580#1447999.

@kaldari: I understand the argumentation, but it doesn't convince me. Your statements in the linked bug seem very reasoned to me. We agree that information about uncertainty is very important, but for each data we have to know the amount of uncertainty specifically. Guessing leads to a much bigger error than ignoring it as long as there is no valid information about the uncertainty.

Therefore, related to this bug: There should be simply no information on uncertainty if it has not been stated with the input. All of +/-0, +/-.5 and +/-1 are (most likely) wrong guesses. A good example is the population of a country: All guesses - not matter if +/-0.5 or +/-1 - are wrong and implicate a degree of accuracy that is not given at all. Providing no information on uncertainty instead would be correct as the user of the data would know he has to be careful with it for exact this reason.

In case there is no information on uncertainty provided, unit conversation can be geared on the number of digits provided.

The status quo, however, is a big problem and should be resolved asap.

@Yellowcard: I agree with you in most cases. Most numbers in Wikidata should probably be considered exact values (per T68580) and should not have any assumed uncertainty. Numbers for measurements, however, have to have some assumed uncertainty. I think the problem is that this bug is really conflating several different bugs:

  1. The Quantity datatype should probably be an exact value datatype (i.e. T68580)
  2. There should be a new datatype called Measurement which has units and assumed uncertainty/precision (the current behavior of Quantity)
  3. The assumed uncertainty of Measurements should be half the resolution (+/- 0.5 for whole numbers), not equal to the resolution (+/- 1)

Would you agree with all of those points? What's your opinion @daniel?

@kaldari captures it well. Measurements have inherent uncertainty, since measurement instruments such as scales can only measure out to so many decimal places. So if a scale tells you something has 3.24 grams of mass, it could be 3.239 or 3.241, but the scale cannot measure at that level of precision, so it rounds. I would say that anything that is a measurement should have a default uncertainty of what the last decimal point is (for my example +/- 0.01). An option to override would be appropriate, of course.

https://www.wikidata.org/w/index.php?title=Q5089194&type=revision&diff=325544717&oldid=314121367
This is another example of how assuming a precision is incorrect - the value given in the source is 135,000,000 gallons without specifying the level of precision. It is very unlikely to be either ±0 or ±1 gallon (although not impossible) however how many significant figures are there? - anywhere from 3 to 9 is possible and, without further investigation, unknowable.

As someone who mainly enters exact integers, this is one of the most infuriating bugs. I always have to go back and correct the precision after the fact. There are also many, many numbers in Wikidata statements with the very obviously wrong precision +-1. At this point, this precision guessing is probably the number one cause for wrong data. Also, it is probably the most asked question in the various forums on Wikidata.

And while I understand the technical reasoning, from a user's point of view, the current behaviour is simply completely bonkers, at least for integers.

Did we ever figure out a way forward on this?

Yellowcard added a comment.EditedJun 13 2016, 9:44 PM

I strongly support @Srittau's comment. This is still an extremely annoying bug and should be fixed with high priority. There are plenty of exact integers and the guessed precision of +/-1 makes the statement wrong. I cannot understand why this is not fixed.

Izno added a subscriber: Izno.Jun 14 2016, 2:23 PM

I'm going to make sure my comment at T68580#1260737 is heard in this context as well.

What's mostly nuts is having any default precision which is not "we don't know what the precision is".

@Izno the assumption was that any decimal number does have an implied precision, by convention. But as it turns out, there isn't one convention, but at least two, and the resulting behavior is confusing.

@Yellowcard actually, exact integers are relatively rare. Exact integers with a unit are even extremely rare. But I agree that the +/-1 guess sis wrong. It should be half that, to be consistent with rounding. And the guess should be performed only when and if needed, not while parsing.

@kaldari We have a good plan: don't guess until we need it, don't output a precision if none was entered explicitly, and if we guess, guess half the interval we currently use.

The implementation keeps being pushed back, partly due to the need of a UI refactor, partly because of a need for data cleanup.

For reference, another user caught by this bug on the German village pump: https://www.wikidata.org/wiki/Wikidata:Forum#Eigenschaft_f.C3.BCr_.22Anzahl_B.C3.A4nde.22.3F