Page MenuHomePhabricator

[Bug] Quantity formatter rounding causes significant data loss
Closed, ResolvedPublic

Description

Enter 1.991+-1 in a quantity statement. It is stored via the API as expected (Amount: 1.991, Upper bound: 2.991, Lower bound: 0.991), but rendered as 2+-1. This causes significant data loss whenever somebody uses the shown value.

Found this while reviewing https://github.com/wmde/WikidataBrowserTests/pull/66.

In T95425#1602181, I added:

This was never exclusively about editing. The editing issue was fixed in https://github.com/wmde/ValueView/pull/183/files#diff-6e3ed44714251084b2ff46be5bb6b80fR54, as a side effect while working on T110183. But when copy-pasting a formatted quantity value there is still the same significant data loss involved and the value does not survive (manual) round trips.

Related Objects

Event Timeline

thiemowmde raised the priority of this task from to Needs Triage.
thiemowmde updated the task description. (Show Details)
thiemowmde added projects: Wikidata, DataValues.
thiemowmde added subscribers: thiemowmde, daniel.

It's not "sloppy" but doing exactly what it is supposed to do: omit irrelevant digits. This is what we want for display.

I agree of course that for editing, we want lossless formatting. I suppose we need a new formatter mode for that.

daniel renamed this task from Quantity formatter rounding is sloppy and causes data loss to Quantity rounding causes data loss when editing.Apr 14 2015, 10:23 AM
daniel updated the task description. (Show Details)
daniel set Security to None.
daniel triaged this task as High priority.Apr 14 2015, 10:26 AM

Bumping to high, since this causes data loss, and it's a general issue that may bite us with other data types too.

As a solution, I suggest to have FORMAT_LOSSLESS in addition to FORMAT_PLAIN. FORMAT_LOSSLESS would fall back to FORMAT_PLAIN, just like FORMAT_HTML_DIFF falls back to FORMAT_HTML.

Sorry, I have to disagree. There may be a sweep spot where such a "formatting for display" is ok, but rounding "1.9+-1" to "2+-1" is clearly wrong, at least in my opinion. These two strings describe two completely different values.

They are two different intervals, sure. If you however consider the +/- as an uncertainty, they are "nearly" the same in the sense that the difference is insignificant. An uncertainty interval of +/-1 is the explicit statement that any difference <1 is insignificant.

If you want the rounding to work differently, please file a separate ticket for that issue, and provide a specification of how you want it to work, and a rationale. Please consider that one of the main use cases is unit conversion: Per convention and common sense, we'd want 3ft (+-/1) to convert to be shown as 1m, not as 0.9144m. That latter would imply a precision that is not present in the original value.

But in any case, let's keep the "data loss in round-trip" issue separate from the "rounding is evil" question.

doing exactly what it is supposed to do: omit irrelevant digits.

Who came up with this? When I enter 1.5+-1 I do not enter "irrelevant" digits. Why should I do that? Why do we have code that "knows better" than the user? I don't get that, sorry. When I enter 1.5+-1 I mean 1.5+-1 and not 2+-1. When I want 2+-1 I enter 2+-1.

let's keep the [...] issue separate

How? I don't understand. This is wrong in all cases. A formatter that silently manipulates 1.5+-1 to 2+-1 does change the actual value by 33%. You can't argue that such a major data loss is insignificant. These two values are not even "nearly" the same.

@thiemowmde: i think this becomes a lot clearer when you think about your original value being in meter, and you are trying to display it in feet. Then we must apply rounding based on the uncertainty, otherwise we would be introducing false precision. And if we do this for the case with conversion, we should also do it without conversion, for consistency.

The rounding is wrong no matter what. 24+/-10 becomes 20+/-10 with the current formatting. This is not even remotely the same value.

15+/-1 feet becomes 5+/-1 meter which becomes 16+/-1 feet.

The "before" and "after" values are not there to mark "irrelevant" digits. There is no such thing as an "irrelevant" digit in Quantity values. We do have this in Time, but not in Quantity.

What rounding would be correct then, and would also work with unit conversion?
I agree that the current output is a bit odd when the uncertainty interval is included.
24+/-10 can be written as 20 (because we don't care about the 4), but I agree that writing it as 20+/-10 is somewhat odd.

24+/-10 can be written as 20 (because we don't care about the 4)

This is simply wrong. These are two completely distinct values. But I'm already repeating myself.

Writing 1.4+-1 means you can't have a precision of 0.4. To be able to write this thing correctly you have to adjust your uncertainty precision to 1.0 so 1.4 +- 1.0 is correct. After for conversion this is another problem.

I'm sorry, but I do not understand that example at all. Where does a "0.4" come from in your example? 1.4+-1 is internally stored as { amount: 1.4, before: 0.4, after: 2.4 }. This is, when displayed via the formatter, displayed and later parsed as 1.4+-1. You can say "precision is 1", but keep in mind that we do not store it that way. In reality there is no precision.

Displaying 1.4+-1.0 instead doesn't change anything. You can say: the precision value can't have a precision.

Jonas renamed this task from Quantity rounding causes data loss when editing to [Bug] Quantity rounding causes data loss when editing.Aug 15 2015, 12:54 PM

Here is an other way to look at this:

  • 24+/-0.01 can be rendered as an interval on a line, ranging from +23.99 to 24.01, centered on 24.
  • 24+/-0.1 can be rendered as an interval on a line, ranging from +23.9 to 24.1, centered on 24.
  • 24+/-1 can be rendered as an interval on a line, ranging from +23 to 25, centered on 24.
  • 24+/-10 can be rendered as an interval on a line, ranging from +14 to 34, centered on 24. But wait, that's not what's happening. It's wrongly rendered as "20+/-10", centered on 20 instead of 24, messing up the series by introducing a very significant error.

One of the issues brought up is data loss. The normal method of converting units does indeed cause data loss: 5 yards +- 1 yard would typically be converted to 5 meters +- 1 meter. As I understand it, unit conversion is not yet implemented. When it is, there should be a far-reaching examination of all parts of the system to make it clearly visible to the reader when a unit has been converted, so the reader will be on notice that data loss has occurred.

thiemowmde renamed this task from [Bug] Quantity rounding causes data loss when editing to [Bug] Quantity formatter rounding causes significant data loss.Sep 3 2015, 1:54 PM
thiemowmde updated the task description. (Show Details)

I changed the tasks description because when I reported this bug it was never exclusively about editing. The editing issue was fixed in https://github.com/wmde/ValueView/pull/183/files#diff-6e3ed44714251084b2ff46be5bb6b80fR54, as a side effect while working on T110183: [Task] Plaintext formatter should render unit . But when copy pasting a formatted quantity value there is still the same significant data loss involved and the value does not survive (manual) round trips.

Round trip stability for rendered output is nice, but was never a design goal. In fact, it doesn't work for most things.
If this is not about editing, saying it causes "data loss" is misleading. "Quantity rendering does not preserve precision" would be descriptive. But that'S not a bug, that's working as designed. Rounding is lossy per definition.

We could just drop rounding, but that would lead to false precision when applying unit conversion. So we could apply rounding only if we do conversion - but that would be even more confusing, don't you think? And of course, if we do conversion, round trips will never work.

In T95425#1602377, @daniel wrote in part:

We could just drop rounding, but that would lead to false precision when applying unit conversion. So we could apply rounding only if we do conversion - but that would be even more confusing, don't you think? And of course, if we do conversion, round trips will never work.

Whether rounding only on conversion would be confusing depends on the background of the reader. The casual reader will be confused by many significant digits (or apparently significant digits) no matter what we do. The reader with a rigorous quantitative background such as scientists and engineers will expect the original value would have been entered correctly, and will recognize the inherent data loss involved in conversion. I think such a reader would expect us to preserve precision when some cases will require such preservation to properly present the original value (even if badly entered values will look bad). Such a user will expect us to round on conversion because that is customary among people with a strong quantitative background.

Preserving the input when there is no conversion also helps to bring badly-entered original values to the attention of editors who can then fix them.

This discussion was frustrating from the start and obviously not going anywhere. I think I already said multiple times that this bug is not about "false precision", whatever that means in the context of the given examples, but about respecting what the user entered. There is nothing "false" when I enter "24+-10". There is no "error" in my input the formatter must magically "correct", with no way out. Why is it so hard to understand the obvious example in T95425#1601799?

@thiemowmde for the same reason 1964 with the precision set to century will be rendered as 20th century, not 1964+/-50.

Basically, we have two conflicting principles here: the principle of least surprise agrees with you thiemo (I entered one thing, but see another, wtf?), and the principle of consistency (wrt rounding) points to my conclusion. In the end, this is a product level decision. The current approach was decided by Denny a long time ago. We can change it, but that would not be a decision on the technical level.

In T95425#1602763, @daniel wrote in part:

@thiemowmde for the same reason 1964 with the precision set to century will be rendered as 20th century, not 1964+/-50.

If you want a solution to be analogous to the way time is handled, then you need to fix time first. I think it would be a mistake to even think about time for this purpose because it's going to take time to fix time.

As an example of what's wrong with time, the user interface does not allow entering time zone. Apparently, the bots that add birth and death dates took their cue from the user interface, and didn't add it. So every birth or death date with a precision of 1 day is wrong, except for those people who were born when and where the time zone offset was 0, such as the United Kingdom in winter.

Time and other measurements are different from each other in the way people think about them. For example, if I carry a credit card that's 85 mm long from Hartford CT to London, it's still 85 mm long. If I go to a bar in New Zealand at 2 pm Sept 4, and show my passport showing I was born in Connecticut USA on Sept 4, 1997, the bartender will serve me even though it isn't September 4 yet in Connecticut. So I think we are going to need different approaches for time and other quantities.

@Jc3s5h I don't see how any of the differences you mention are relevant in the context of precision/uncertainty.

@Jc3s5h I don't see how any of the differences you mention are relevant in the context of precision/uncertainty.

Let me give you another example. If you put in a measurement of 725, it goes in the database as amount = 725, lowerBound = 724, upperBound = 726. The interpretation is fairly obvious.

The death date of Benjamin Franklin in the data base is given (in JSON) as

"P570":[{"mainsnak":{"snaktype":"value","property":"P570","datavalue":{"value":{"time":"+1790-04-17T00:00:00Z","timezone":0,"before":0,"after":0,"precision":11,"calendarmodel":"http://www.wikidata.org/entity/Q1985727"},"type":"time"},"datatype":"time"}

So this means he died no earlier 5 PM April 16 1790, Philadelphia time, and no later than 5 PM April 17, 1790, Philadelpia time. Clearly the statement does not cover the actual range of times that Franklin died. Considering that vast majority of dates in Wikidata suffer from this flaw, time as implemented in Wikidatat serves as a poor model for anything.

Proof of concept: https://github.com/DataValues/Number/pull/43

@daniel wrote:

for the same reason 1964 with the precision set to century will be rendered as 20th century, not 1964+/-50.

Irrelevant, misleading comparison. Quantities do not have a precision, they have a lower and upper bound.

principle of consistency

Wut? How is the example in T95425#1601799 "consistent"?

not be a decision on the technical level.

So you refuse to call the broken behavior I showed in T95425#1601799 a bug? Again, I don't know how this could be more obvious.

@Jc3s5h wrote:

As an example of what's wrong with time [...] The death date of Benjamin Franklin [...]

Hurray, an other discussion about time issues in a ticket where it does not belong, where it won't be read, can not be considered and does not do anything but waste developers time when they have to figure out what the relevance for the issue in this ticket is. Hint: It's zero.

principle of consistency

Wut? How is the example in T95425#1601799 "consistent"?

I'm not terribly familiar with the user interface of phabricator. I don't know how to follow the link T95425#1601799 to get to the exact spot you are referring to.

@Jc3s5h wrote:

As an example of what's wrong with time [...] The death date of Benjamin Franklin [...]

Hurray, an other discussion about time issues in a ticket where it does not belong, where it won't be read, can not be considered and does not do anything but waste developers time when they have to figure out what the relevance for the issue in this ticket is. Hint: It's zero.

You have missed my point, which is that since time is so messed up we should totally disregard how time represents uncertainty while working on representing the uncertainty of quantities.

As far as I know, this is resolved for the editing use case. Rounding still applies for HTML output. I think this should be either reworded or closed.

This is still causing incorrect data to be displayed.
I've entered a value of 350±150 because the source gives a range of 200-500. However this is displayed as 400±200 which gives a range of 200-600 which is incorrect and misleading.

I don't see why there is any need or justification for displaying the output differently to the stored value - even more so when the difference is so significant (why is there ever need to round to the nearest 100 for values of this magnitude? The nearest 10 would be understandable (but still wrong) but removing an order of magnitude more precision than that is baffling).

As far as I know, this is resolved for the editing use case. Rounding still applies for HTML output. I think this should be either reworded or closed.

This is still causing data loss, just for those who use the shown data rather than those who use the API stored value so the title still seems correct to me.

The reason 350±150 is shown as 400±200 is that rounding is applied based on the uncertainty, and the same rounding is applied to the uncertainty itself (basically, you cannot be more precise about the uncertainty than about the value itself). The reason we apply rounding here is for consistency: if unit conversion is applied, we have to apply rounding to avoid false precision. So for consistency, we always apply rounding.

But this is not set in stone. I'm coming around to the opinion to apply rounding only when needed, and to show values as-is otherwise.

Regarding the "data loss" aspect of the HTML rendering I disagree: we have never guaranteed or even tried to make the HTML representation lossless. That would be rather hard to do, and would contradict the wish to have "nice" readable values in HTML. This is particularly true for values that are references to other entities - we show them using labels, not IDs, so when you copy&paste them, you lose information.

@daniel wrote:

The reason 350±150 is shown as 400±200 is that rounding is applied based on the uncertainty

I'm afraid this does not explain anything. The uncertainty is +/-150. Not +/-200. Rounding to +/-150 would mean ... rounding to 2 * 150 = 300? But why? The value is 350. What's wrong with stating that 350+/-150 is 350+/-150? What do we win by stating that 350+/-150 is 400+/-200?

And why round the uncertainty based on ... what? Based on the uncertainty? How does this make sense?

Converting 350+/-150 feet to meter results in 106.68+/-45.72 meter. There is neither anything wrong with 350+/-150 feet nor with 106.68+/-45.72 meter. We can think about suppressing irrelevant digits and render this as a value that survives a roundtrip back to the original unit, e.g. 107+/-46 meter. Or rounding to 1% of the uncertainty. But this is not what happens currently.

Sure, an output like "106.68 meter" without the uncertainty would indeed introduce "false precision". But even this does not mean we must round this to 100. The error introduced by this is worse than the error by the false precision. We must still round this converted value to something that survives a roundtrip back to the original unit with an acceptable error of about 1%.

we have never [...] tried to make the HTML representation lossless.

I'm not sure where this comes from. All string types are rendered in a perfectly lossless format. Identifiers, URLs, Commons media, all lossless. Time values are usually lossless, with rare edge cases.

references to other entities [...] when you copy&paste them, you lose information.

You copy them by right click and copying the id in the URL.

@thiemowmde: That's what happens when you use precision to express a range. I agree that it's confusing. +/-1 says that anything after the decimal point is insignificant. If you construct a range from that, the result is counter-intuitive.

The best way to avoid this would be to save just the number of significant digits instead of an uncertainty interval. That would make it clear: 1.5 with 1 significant digits would be clear: that's 1. However, that would be a breaking change to our data (value) model.

This argument is invalid for multiple reasons:

  1. It does not matter if you call it a "range" or a "possibility". For all arguments shared that's the same.
    • In 1.5±1.0 it's possible that the tanker is sitting somewhere between 0.5 and 2.5. But it's impossible to sit at 3.0.
    • In 2±1 all possibilities do a magic jump of 0.5 to the right, rendering 0.5 impossible and 3.0 possible. This is just wrong; read: this is neither what 1.5±1.0 means nor what the user meant when he entered this. As I tried to illustrate above an added error of 0.5 is highly significant. This is equivalent to 33% of the value or 50% of the precision.
  2. What we call "precision" is not an integer. It's a floating point number. It does not say anything about "digits" and how "significant" they are. I think I already repeated this more often than I wanted to: there is no such thing as an "irrelevant" digit in here. 1.25+/-0.5 does not mean something like "half of the first digit after the decimal point is irrelevant". What is this even supposed to mean? How would you display this, so a user can read and understand it? Oh, I know: What about 1.25+/-0.5? What's unclear about this? Why do values with precision need rounding at all?

I propose this as an acceptance criteria for an acceptable formatter: Silent manipulations of more than 1% of the original value are unacceptable.

In general: Situations where code thinks it knows better than the user are unacceptable.

This just came up on-wiki again at https://www.wikidata.org/wiki/Wikidata:Project_chat#Rounding_when_uncertainty_over_10_is_given where someone added a statement with 547±17 (which is 530-564) and instead it displays 550±20 (which is 530-570).

This just came up on-wiki again at https://www.wikidata.org/wiki/Wikidata:Project_chat#Rounding_when_uncertainty_over_10_is_given where someone added a statement with 547±17 (which is 530-564) and instead it displays 550±20 (which is 530-570).

That's about my comment.
The present situation is unacceptable. It does not matter if the value is stored correctly or not while it is displayed completely wrong and can be misleading. I see no reason for that the automatic mechanism could decide, when and how the values are rounded. That's an absurd. The mechanism does not know when the last digits are significant or not. We can't even precise what type of uncertainty it is.

In this case, the source stated 547±17 for some reason. Not 550±20. On what basis the mechanism round the two significant digits of uncertainty to useless "20"? Your mechanism cannot be smarter than the human. In my field of work such "rounding" may be very annoying, costly or even dangerous, becasue the "insignificant digits" matters.

I agree, we should really fix T117457: Do not apply rounding when formatting Quantities (unless unit conversion was applied). That should resolve the data loss issue.

The full solution is a bit more complicated, see T105623: [Task] Investigate quantification of quantity precision (+/- 1 or +/- 0.5). We have been working on this during the Wikimania hackathon, see https://github.com/DataValues/Number/pull/66

thiemowmde claimed this task.

Since https://github.com/DataValues/Number/pull/68 we are not applying artificial rounding any more when the precision is shown. \o/