Page MenuHomePhabricator

Display very large or very small quantity values using scientific notation
Open, MediumPublic

Description

After entering a value for the Planck constant (https://www.wikidata.org/wiki/Q122894) in terms of its SI unit joule-seconds, the value of 6.6x10^-34 displays in wikidata as 0.0000000000 joule-second (and the uncertainty value also disappears). This isn't very useful. In a similar vein, entering the half-life of Bi-209 (https://www.wikidata.org/wiki/Q18888193) which is 1.9 +- 0.2 x 10^19 years displays in wikidata as 19,000,000,000,000,000,000±2,000,000,000,000,000,000, a little hard to read. I think python's 'g' format defaults are reasonable - for anything below 10^-4 or above 10^6 (or above the length of the number in the default precision) display in scientific notation with an 'e', otherwise display as a regular number. Or quantity display format could perhaps be a user-specified setting as with language.

Event Timeline

ArthurPSmith raised the priority of this task from to Needs Triage.
ArthurPSmith updated the task description. (Show Details)
ArthurPSmith added a project: Wikidata.
ArthurPSmith subscribed.

Scientific values are not the only place where large numbers occur. For example, the total assets of https://www.wikidata.org/wiki/Q312 and the population of https://www.wikidata.org/wiki/Q46 are both above 10^6 and are not places where scientific notation would be expected. Looking at http://tinyurl.com/ybhuvam6, most properties with values over 10^6 are things relating to amounts of money and numbers of people. I think this should depend on the property.

I think scientific notation should be triggered by precision/uncertainty, not by value or property. By convention, scientific notation is used to indicate insignificant digits:
E.g. 3000,000+-10,000 can be written as 300e4, while 3000,000+-0 must be written as 3000,000.

E.g. 3000,000+-10,000 can be written as 300e4, while 3000,000+-0 must be written as 3000,000.

In theory, yes. In practice, we have tons of values in the DB which are +-0 but only because nobody bothered to set the proper precision. In fact, I think most of the people have very vague idea about how that works. Also, given the current UI, it's kind of hard to set the correct one. If you are dealing with something like US GDP, we have 17,419,000,000,000 as a value, and it has +-1 as precision, which is clearly absolutely wrong, nobody knows it up to a dollar. But to make it right, one needs to type 1000000000 (my guess about the actual precision) without making a mistake in the count of zeroes. And do it consistently for every GDP figure. The odds of that happening on Wikidata seem very low to me.

OTOH, I would very much appreciate seeing the number above as 17.419 trln or 17,419 bln or at least 17.419×10⁹. It's much easier to consume than a huge row of zeroes.

@Smalyshev To fix it, you type 17419e9. Also, +/-0 only happens if people set it explicitly. +/-1 used to be applied if you didn't do anything. This is no longer the case either.

The more you use and expose the data, the more incentive there is to fix it. Let's not try and be smart and "corrent" wonkey data. Show wonkey data doing wonkey things, so user will see and fix it.

"we'll guess what you meant and make it look nice" is bad. "To make this show like you wanted, go fix the precision" is good. The goal is to have correct data, not nicely looking data.

I did some basic research and right now we have for values >= 10^12:

  • total values: 918
  • no precision: 456 (~50%)
  • +-0 precision: 160 (17%)
  • small (<10) value for precision: 22 (2%)
  • bigger precision value: 280 (30%)

for values >= 10^9:

  • total values: 15670
  • no precision: 5677 (36%)
  • +-0 precision: 5250 (33%)
  • small (<10) value for precision: 435 (2%)
  • bigger precision value: 4308 (27%)

(this does not include very small ones but the pattern seems to be pretty clear so I didn't waste time - it's easy to do it if anybody wants)
I think this says that we can pretty much ignore the values with small precisions (i.e., assume that they are accurate even though in some cases they are not, and fix the wrong ones manually or semi-manually).

However, the question of zero-precision ones stays. Many of them are not really zero-precision but just values entered without proper precision (we didn't have capability of not entering precision before) - and for those, I think we can compact trailing zeroes, i.e. display 3000000+-0 as "3 mln" (or whatever form we choose).

The goal is to have correct data, not nicely looking data.

I think the goal is both. If we have correct data that is hardly useable because it is incomprehensible, it's almost as bad as having no data. We could use combined displays (hovers? titles? any other JS/CSS tricks?) to enable seeing both compact and full forms, but right now I feel big values are rather hard to work with.

I think the goal is both. If we have correct data that is hardly useable because it is incomprehensible, it's almost as bad as having no data. We could use combined displays (hovers? titles? any other JS/CSS tricks?) to enable seeing both compact and full forms, but right now I feel big values are rather hard to work with.

I'm not saying we shouldn't make things look pretty. I'm saying we shouldn't make *wrong* things look pretty my guessing.

+/-0 was *never* assumed automatically (+/-1 was). If it's there, it's there because someone put it there. And it was always possible to specify precision.

If we support compact notation based on precision, that will make the impact of entering the "wrong" precision more obvious. I think this is a good thing, not something to be avoided.

If it's there, it's there because someone put it there.

Possible. But given we have 1/4 to 1/3 of the data like this, and I have super hard time believing they are all exactly precise numbers, and the counts are such that manually reviewing them is out of the question - I think we might have to think about some better solution. I'm not against starting with no-precision and correct-precision ones, that covers 2/3 of the cases, but we should think about the rest 1/3 too.

There was widespread use of ±0 by the community because, until recently, it was the only way to make it work the way people expected it to, which was to display just the value that they were trying to enter without doing anything weird to it. There were even bots which changed ±1 values to ±0 to remove the unwanted uncertainties. The majority of statements which now say ±0 really mean that we did not want to enter an uncertainty. (You might think people were wrong to do that, but that's what happens when it doesn't work the way users expect.)

I think it would be a mistake to use scientific notation without taking the property into account. Scientific notation is largely restricted to scientific contexts and is not universally understood. The quantity datatype has to be used for all quantities and any quantity can be given to a certain precision.

I wonder if it's be easy to find the changes made by those bots and change it from +-0 to "no precision".

As for precise display format to use, it is a separate question - maybe we could have a property that dictates how this property formats big (long) numbers?

Would it not be appropriate to have two separate formatting tools - one oriented towars scientific use and one towards business/economic use. My interest is for scientific use where I suggest the following fields:
*value (mandatory)
*exponent (mandatory)
*uncertainty (optional)
*positive uncertainty (optional)
*negative uncertainty (optional)
*units (optional)
The uncetainty fields would be subject to the contraints:
*It is not requried that nany uncertainty fields be specified.
*If the field "uncertainty" is specified, then neither the "positive uncertainty" nor the "negative uncertainty" fields may be specified.
*If either of the fields "positive uncertainty" or "negative uncertainty" is specified, then the other must also be specified.