Page MenuHomePhabricator

Thousand separator for large numbers
Open, Needs TriagePublic

Description

Using the sequence {{#property:P1181}} or {{formatnum:{{formatnum:{{#property:P1181}}|R}}}}, the parser function formatnum will use non-breaking spaces. This creates some problems.

See for example w:no:Pi. In this case the number is so long it goes out of the right side of the page, even if it has “spaces”. This formatting (ie missing grouping after decimal point) is wrong in Norwegian according to Norsk språkråd.[https://www.sprakradet.no/sprakhjelp/Skriveregler/Dato/#store]

Another example is w:en:Euler–Mascheroni constant which has no visual thousand separator. Still, what they want to do at enwiki should be up to that project.

A simple solution would be to replace every fourth occurrence of a non-breaking space with an ordinary space, still not before there are at least eighth occurrences in the number. If so every fourth occurrence from the decimal point is replaced. This should leave most numbers unchanged, but still let very large numbers break as necessary.

(A better solution would be to always use medium mathematical space as this keeps the semantic info about this being a single number.)

For other languages a zero-width space could be injected after the thousand separator in a similar fashion. That can even work for languages that has no visible thousand separator.

Event Timeline

I'm not sure it's possible to automatically figure out the correct action with regards to line breaking here.

Made w:no:Module:LargeNum to handle formatting, and it seems to work reasonable well. See for example w:no:Pi.

Note that it replaces no-breaking space with medium mathematical space, which makes it possible to interpret the number groups a single number.

Not sure whether I sould close the task, as the ordinary formatting is still wrong.

The tradition for formatting long numbers has always been to have several grouping lengths, i.e. small groups of 3 digits separated by non-breaking spaces (best if these are narrow, i.e. NNBSP U+202F whose width is about 0.3em, not NBSP U+00A0 whose width is about 0.5em), but still allow larger groups to split on long lines.

Typographic traditions is to avoid overlong lines with more than 36em. In 36em, you could fit 72 digits, but with the extra group separators you can fit at most 54 digits (with 18 separators).

It seems reasonnable also for narrower columns of texts to be used, and we could group by at most 30 digits (with 10 groups separators) before adding a secondary group separator, mostly the same but breakable.

So:

  • primary group separator: NNBSP every 3 digits
  • secondary group separator: THINSP every 30 digits (or every 10 primary groups), which offers a neat way for counting groups.

Note that English uses commas but without any space, and a comma between two digits is also normally not breakable. Some Swiss German locales use apostrophes as group separators, as well they are not breakable between two digits. Other locales use a full stop as group separators but this is not recommanded and the narrow non-breaking space is prefered (it is also preferred for ISO formats to the classic English comma).

CJK locales traditionnally use their own sinograms for digits, and all sinograms are individually breakable (except after an opening punctuation, or before a closing puntuation, these punctuations being glued to these sinograms), and grouping is not used at all: grouping is naturally performed by the standard columns of text (with the modern horizontal presentation) or rows of text (with the traditional vertical presentation).
They can also use "halfwidth" digits (which are mostly like normal European digits, except that they behave like CJK sinograms and are breakable, these digits being just centered in one half of each sinographic square).

"Halfwidth" CJK digits are also used sometimes in non-CJK context, notably for presenting long numbers with arbitrary precision: instead of explicit group separators, there's an exact and predictable width for each digit taking exactly 0.5 em of width: you group digits using columns of text, and all these CJK digits will be breakable: you just have to adjust the column width to the desired grouping.

Note that "Formatnum:" has not really been designed to format numbers with very hgh precision, it was always limited to the precisions of IEEE doubles (i. e. 18 digits in mantissa, but even if we add the extra signs, dot and exponent notation, we are still largely below the 36em limit, and all these numbers fit inside a 18em-wide column).

So this is an isse only for packages dealing with big numbers, whose formatters should support secondary groups.

The secondary group should always be a multiple of the primary group (but be careful with some Indian locales where these groups are 4-digits and not 3-digits: setting the secondary groups to be every 10 primary groups would create secondary groups of 40 digits instead of 30, and note that for some Indian locales their digits are larger than European digits, so they would need to reduce the secondary groups possibly every 5 primary groups, i.e. every 20 digits)

So the secondary grouping could be locale-sensitive.