Page MenuHomePhabricator

Invariant failed: Bad UTF-8 (full string verification)
Closed, ResolvedPublicPRODUCTION ERROR

Description

message
Invariant failed: Bad UTF-8 (full string verification)

Notes

There are other bad utf-8 assertion bugs (start of string, end of string). I think this is the first instance of this particular assertion failing so far in all our testing so far.

New tracking category

A good & detailed explanation of the new tracking category is at the English Wikipedia's page: https://en.wikipedia.org/wiki/Category:Pages_with_non-numeric_formatnum_arguments

Details

Request ID
XcCwRQpAIDgAAA2zlU8AAAEC
Request URL
/w/rest.php/kn.wikipedia.org/v3/page/pagebundle/%E0%B2%AE%E0%B3%87%E0%B2%B0%E0%B2%BF%E0%B2%B2%E0%B3%8D%E0%B2%AF%E0%B2%BE%E0%B2%82%E0%B2%A1%E0%B3%8D/950047
Stack Trace
exception.trace
#0 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Utils/PHPUtils.php(276): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/PegTokenizer.php(94): Parsoid\Utils\PHPUtils::assertValidUTF8(string)
#2 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/ParserPipeline.php(127): Parsoid\Wt2Html\PegTokenizer->process(string, array)
#3 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Utils/PipelineUtils.php(110): Parsoid\Wt2Html\ParserPipeline->parse(string, array)
#4 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TT/TemplateHandler.php(620): Parsoid\Utils\PipelineUtils::processContentInPipeline(Parsoid\Config\Env, Parsoid\Wt2Html\PageConfigFrame, string, array)
#5 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TT/TemplateHandler.php(1399): Parsoid\Wt2Html\TT\TemplateHandler->processTemplateSource(array, array, string)
#6 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TT/TemplateHandler.php(1451): Parsoid\Wt2Html\TT\TemplateHandler->onTemplate(Parsoid\Tokens\SelfclosingTagTk)
#7 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TT/TokenHandler.php(211): Parsoid\Wt2Html\TT\TemplateHandler->onTag(Parsoid\Tokens\SelfclosingTagTk)
#8 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TokenTransformManager.php(120): Parsoid\Wt2Html\TT\TokenHandler->process(array)
#9 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TokenTransformManager.php(192): Parsoid\Wt2Html\TokenTransformManager->processChunk(array)
#10 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/TokenTransformManager.php(190): Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#11 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/HTML5TreeBuilder.php(431): Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#12 [internal function]: Parsoid\Wt2Html\HTML5TreeBuilder->processChunkily(string, array)
#13 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/DOMPostProcessor.php(895): Generator->current()
#14 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/ParserPipeline.php(148): Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#15 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/ParserPipeline.php(198): Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#16 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Wt2Html/ParserPipelineFactory.php(308): Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#17 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/WikitextContentModelHandler.php(78): Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#18 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Parsoid.php(86): Parsoid\WikitextContentModelHandler->toHTML(Parsoid\Config\Env)
#19 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/src/Parsoid.php(113): Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#20 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/extension/src/Rest/Handler/ParsoidHandler.php(543): Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#21 /srv/deployment/parsoid/deploy-cache/revs/a69ec92e21cc4be117daaadef4a8fc5bf5813fcf/src/extension/src/Rest/Handler/PageHandler.php(55): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(Parsoid\Config\Env, array)
#22 /includes/Rest/Router.php(315): MWParsoid\Rest\Handler\PageHandler->execute()
#23 /includes/Rest/Router.php(285): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#24 /includes/Rest/EntryPoint.php(116): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#25 /includes/Rest/EntryPoint.php(83): MediaWiki\Rest\EntryPoint->execute()
#26 /rest.php(31): MediaWiki\Rest\EntryPoint::main()
#27 /srv/mediawiki/w/rest.php(3): require(string)
#28 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 623744 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/Cite@master] Stop using Language::formatNum to localize separators

https://gerrit.wikimedia.org/r/623744

The tests in https://gerrit.wikimedia.org/r/623442 currently fail for several reasons. One is code in the Cite extension. I uploaded https://gerrit.wikimedia.org/r/623744 to change this code so it doesn't rely any more on the discussed behavior.

I also uploaded https://gerrit.wikimedia.org/r/623700 where I add test cases that demonstrate how https://gerrit.wikimedia.org/r/623442 changes the behavior of Language::formatNum(). This is merely a suggestion and additional input for the discussion, not meant to block anything, and not meant as an argument about what is right or wrong. Please feel free to change or revert anything I did.

However, I would like to point out that Language::formatNum() previously accepted any string, not only numeric ones. While this is not obvious from the name formatNum – even counter-intuitive – it is how formatNum always worked, and how it is used in several places. It doesn't really format numbers. For example, it's unable to change the number of leading or trailing zeros, unable to change the number of decimal places, unable to round numbers, unable to add units. That's what we expect from something called "format number", but that's not what formatNum does. What it does is converting numeric characters and delimiters into localized ones, and optionally adding delimiters. For example, it will happily turn a string like "In May, 2500 users did 1900000 edits" into "In May, 2,500 users did 1,900,000 edits".

formatNum is just badly named.

As the patch is now, it returns an empty string instead. As far as I'm concerned this is a breaking change that would need to be announced as such. I'm not arguing against this change. We can make it. But when reading this tasks description it appears like it's not necessary to make this change. Maybe we can fix the UTF-8 issue without that additional change? I will happily give it a try if it helps.

Hmm .. okay. I would rather us not bite off that formatNum change problem. There are always so many things to fix and change in MediaWiki that we want to be a bit deliberate about it. So, if there is another way to fix this UTF-8 issue without causing changes to users of commafy, then that is what I would prefer. Thanks for flagging this.

cscott added a comment.Sep 3 2020, 4:28 PM

However, I would like to point out that Language::formatNum() previously accepted any string, not only numeric ones. While this is not obvious from the name formatNum – even counter-intuitive – it is how formatNum always worked, and how it is used in several places. It doesn't really format numbers. For example, it's unable to change the number of leading or trailing zeros, unable to change the number of decimal places, unable to round numbers, unable to add units. That's what we expect from something called "format number", but that's not what formatNum does. What it does is converting numeric characters and delimiters into localized ones, and optionally adding delimiters. For example, it will happily turn a string like "In May, 2500 users did 1900000 edits" into "In May, 2,500 users did 1,900,000 edits".

formatNum is just badly named.

As the patch is now, it returns an empty string instead. As far as I'm concerned this is a breaking change that would need to be announced as such. I'm not arguing against this change. We can make it. But when reading this tasks description it appears like it's not necessary to make this change. Maybe we can fix the UTF-8 issue without that additional change? I will happily give it a try if it helps.

I think part of the problem here is that formatNum had an extremely-poorly specified contract. There is certainly existing code in formatNum which will (even before my patch) fail and return the empty string if you pass in strings like the ones above. Unfortunately, these paths *only execute for i18n locales* -- the special case that handles the "normal" 3-digit groupings will work fine as you state.

This means that (as usual) we have prioritized European/Latin-script languages, and silently broken other locales, by relying on behavior that only works for 3-digit groupings. This is not good.

It sounds like we should extract the numeric groups and then run a narrower formatNum on those groups, which would preserve the behavior you describe for all locales, not just 3-digit grouping locales.

Change 623744 merged by jenkins-bot:
[mediawiki/extensions/Cite@master] Stop using Language::formatNum to localize separators

https://gerrit.wikimedia.org/r/623744

Krinkle added a subscriber: Krinkle.

Moving to June to match the merged T236866 and to reflect when it was reported, not when it was found to have been caused etc.

Change 625587 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: C. Scott Ananian):
[mediawiki/core@master] languages: Minimal fix for UTF-8 corruption in Number::commafy

https://gerrit.wikimedia.org/r/625587

Change 626005 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Ensure we only try to formatNum() numeric strings

https://gerrit.wikimedia.org/r/626005

Change 626024 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] CoreParserFunctions: ensure formatNum is only called on numeric strings

https://gerrit.wikimedia.org/r/626024

Change 626053 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/extensions/AbuseFilter@master] AbuseFilterViewEdit: only invoke Language::filterNum on a numeric string

https://gerrit.wikimedia.org/r/626053

Change 626053 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] AbuseFilterViewEdit: only invoke Language::filterNum on a numeric string

https://gerrit.wikimedia.org/r/626053

Change 626024 merged by jenkins-bot:
[mediawiki/core@master] CoreParserFunctions: ensure formatNum is only called on numeric strings

https://gerrit.wikimedia.org/r/626024

ssastry assigned this task to cscott.Sep 16 2020, 2:12 PM

Change 625587 abandoned by Thiemo Kreuz (WMDE):
[mediawiki/core@master] languages: Minimal fix for UTF-8 corruption in Number::commafy

Reason:

https://gerrit.wikimedia.org/r/625587

Change 626005 merged by jenkins-bot:
[mediawiki/core@master] EditPage: ensure we only try to formatNum() numeric strings

https://gerrit.wikimedia.org/r/626005

Change 623442 merged by jenkins-bot:
[mediawiki/core@master] Language: ensure commafy does not corrupt UTF-8 strings

https://gerrit.wikimedia.org/r/623442

Change 629219 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Disable deprecated warning in Language::commafy() for non numeric string

https://gerrit.wikimedia.org/r/629219

Change 629188 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.36.0-wmf.10] Disable deprecated warning in Language::commafy() for non numeric string

https://gerrit.wikimedia.org/r/629188

Reedy added a subscriber: Reedy.Sep 22 2020, 9:04 PM

Due to T263592: Use of Language::commafy with a non-numeric string was deprecated in MediaWiki 1.36. [Called from Language::formatNum], can we change this to a more specific log for a while to try and fix up some usages rather than causing logspam?

Change 629188 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.10] Disable deprecated warning in Language::commafy() for non numeric string

https://gerrit.wikimedia.org/r/629188

Change 629203 had a related patch set uploaded (by Urbanecm; owner: Urbanecm):
[mediawiki/core@wmf/1.36.0-wmf.10] Revert "Disable deprecated warning in Language::commafy() for non numeric string"

https://gerrit.wikimedia.org/r/629203

Change 629203 merged by Urbanecm:
[mediawiki/core@wmf/1.36.0-wmf.10] Revert "Disable deprecated warning in Language::commafy() for non numeric string"

https://gerrit.wikimedia.org/r/629203

Mentioned in SAL (#wikimedia-operations) [2020-09-23T11:38:47Z] <Urbanecm> Revert https://gerrit.wikimedia.org/r/c/mediawiki/core/+/629188 and fetch to deploy1001 to unblock EU B&C deployment (T237467; cc twentyafterfour)

@mmodell I see you've +2'ed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/629188, but did not deploy the patch. For that reason, I reverted it, to unblock EU B&C window. Feel free to deploy it, but please do not +2 in a wmf branch without deploying. Thanks!

@Urbanecm I apologize, I meant to deploy that last night.

Change 629426 had a related patch set uploaded (by Ahmon Dancy; owner: Reedy):
[mediawiki/core@wmf/1.36.0-wmf.10] Disable deprecated warning in Language::commafy() for non numeric string

https://gerrit.wikimedia.org/r/629426

Change 629219 merged by jenkins-bot:
[mediawiki/core@master] Disable deprecated warning in Language::commafy() for non numeric string

https://gerrit.wikimedia.org/r/629219

Change 629426 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.10] Disable deprecated warning in Language::commafy() for non numeric string

https://gerrit.wikimedia.org/r/629426

Reedy added a comment.Sep 27 2020, 2:56 PM

Because it hasn't been registered correctly

Change 630377 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Register 'nonnumeric-formatnum' in TrackingCategories::$coreTrackingCategories

https://gerrit.wikimedia.org/r/630377

Change 630377 merged by jenkins-bot:
[mediawiki/core@master] Register 'nonnumeric-formatnum' in TrackingCategories::$coreTrackingCategories

https://gerrit.wikimedia.org/r/630377

The following wikicode samples assign the non-numeric error category on en.WP:

{{formatnum:−9.1}}
{{formatnum:−9000000}}

The input, negative 9.1 or negative nine million, are valid numbers in the appropriate format. Should they throw an error? My apologies if I am reporting this in the wrong ticket.

For {{formatnum:1,234}} it gives warning tracking category, so to preserve the old behavior editors would need to start with reverse format, followed by formatting e,: {{formatnum::{{{x}}|}} -> {{formatnum:{{formatnum:{{{x}}|R}}}} ?

or is there a different better way to format unknown input (with or without commas) into nicely formatted number?

matej_suchanek added a subscriber: matej_suchanek.

Could we get a summary of efforts and outcomes for Tech News?

Still seen in production today for wmf.11:

reqId: f6017f38-e1bf-4bf5-a3a1-5388d1acfa6e

#0 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(258): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(115): Wikimedia\Parsoid\Utils\PHPUtils::assertValidUTF8(string)
#2 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\PegTokenizer->processChunkily(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#4 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#5 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/HTML5TreeBuilder.php(420): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#6 [internal function]: Wikimedia\Parsoid\Wt2Html\HTML5TreeBuilder->processChunkily(string, array)
#7 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(900): Generator->current()
#8 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(152): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#9 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(202): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#10 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#11 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Core/WikitextContentModelHandler.php(81): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#12 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(161): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM(Wikimedia\Parsoid\Config\Env)
#13 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(193): Wikimedia\Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#14 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(588): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#15 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(88): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#16 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(1047): MWParsoid\Rest\Handler\PageHandler->realExecute()
#17 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(381): MWParsoid\Rest\Handler\ParsoidHandler->execute()
#18 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(316): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#19 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(155): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#20 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(119): MediaWiki\Rest\EntryPoint->execute()
#21 /srv/mediawiki/php-1.36.0-wmf.11/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#22 /srv/mediawiki/w/rest.php(3): require(string)
#23 {main}

The following wikicode samples assign the non-numeric error category on en.WP:

{{formatnum:−9.1}}
{{formatnum:−9000000}}

The input, negative 9.1 or negative nine million, are valid numbers in the appropriate format. Should they throw an error? My apologies if I am reporting this in the wrong ticket.

I can't reproduce this: https://en.wikipedia.org/wiki/User:Cscott/T237467

For {{formatnum:1,234}} it gives warning tracking category, so to preserve the old behavior editors would need to start with reverse format, followed by formatting e,: {{formatnum::{{{x}}|}} -> {{formatnum:{{formatnum:{{{x}}|R}}}} ?

or is there a different better way to format unknown input (with or without commas) into nicely formatted number?

https://en.wikipedia.org/wiki/Category:Pages_with_non-numeric_formatnum_arguments provides some alternatives. Best is to control your input so that it is not 'unknown' and ensure that it is always in the format you expect; this is the only solution which is robustly localizable to user preferences; otherwise there will always be ambiguity about the meanings of . and , etc.

Change 635374 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Re-enable deprecated warning in Language::commafy() for non-numeric string

https://gerrit.wikimedia.org/r/635374

The following wikicode samples assign the non-numeric error category on en.WP:

{{formatnum:−9.1}}
{{formatnum:−9000000}}

The input, negative 9.1 or negative nine million, are valid numbers in the appropriate format. Should they throw an error? My apologies if I am reporting this in the wrong ticket.

I can't reproduce this: https://en.wikipedia.org/wiki/User:Cscott/T237467

I can reproduce it in article space. Try searching for insource:/formatnum:−/

The problem currently appears to manifest here:
https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter
https://en.wikipedia.org/wiki/Almirante_Latorre-class_battleship

Change 635374 merged by jenkins-bot:
[mediawiki/core@master] Re-enable deprecation warning in Language::commafy() for non-numeric string

https://gerrit.wikimedia.org/r/635374

cscott added a comment.EditedOct 21 2020, 2:40 AM

I can reproduce it in article space. Try searching for insource:/formatnum:−/

The problem currently appears to manifest here:
https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter
https://en.wikipedia.org/wiki/Almirante_Latorre-class_battleship

Those aren't - (ie, \u2D, https://en.wikipedia.org/wiki/Hyphen-minus) those are \u2212 which may look pretty but is not considered numeric.

Note that the *input* to {{formatnum}} is expected to be a "computer number"/computed number without formatting: ASCII digits, no commas, and . to separate the decimal. If you wanted to use U+2212 to pretty-up the output, you'd do that in the *output* of {{formatnum}}, which is human-readable and *not* unambiguously computer-readable (ie, roles of . and , are variable).

The following wikicode samples assign the non-numeric error category on en.WP:

{{formatnum:−9.1}}
{{formatnum:−9000000}}

The input, negative 9.1 or negative nine million, are valid numbers in the appropriate format. Should they throw an error? My apologies if I am reporting this in the wrong ticket.

I can't reproduce this: https://en.wikipedia.org/wiki/User:Cscott/T237467

I can reproduce it in article space. Try searching for insource:/formatnum:−/

The problem currently appears to manifest here:
https://en.wikipedia.org/wiki/Savitzky%E2%80%93Golay_filter
https://en.wikipedia.org/wiki/Almirante_Latorre-class_battleship

Those aren't - (ie, \u2D, https://en.wikipedia.org/wiki/Hyphen-minus) those are \u2212 which may look pretty but is not considered numeric.

Except for the last clause, you're right; that's the bug: the Unicode minus sign, U+2212, is not recognized as a minus sign mathematical operator by formatnum. U+2212 is the Unicode minus sign mathematical operator.

https://en.wikipedia.org/wiki/Template:Minus
https://en.wikipedia.org/wiki/Plus_and_minus_signs#Minus_sign

https://www.unicode.org/charts/PDF/U2200.pdf (top of page labeled "2200 - Mathematical Operators - 2224")

Sorry, that's not how formatnum works. It takes a PHP numeric string and formats it for human consumption, as per http://unicode.org/reports/tr35/tr35-numbers.html#Number_Format_Patterns and https://www.php.net/manual/en/class.numberformatter.php. It is intended for making the output of computed values more human-friendly.
This is how it has always worked, it just used to be more tolerant and pass through non-numeric characters (like U+2212) unmodified, bypassing localization and user numeric preferences. Please open a different phab task if you would like to change it (I'd be supportive of making the *output* of formatnum use a U+2212 for instance).

Still seen in production today for wmf.11:

reqId: f6017f38-e1bf-4bf5-a3a1-5388d1acfa6e

#0 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(258): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(115): Wikimedia\Parsoid\Utils\PHPUtils::assertValidUTF8(string)
#2 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\PegTokenizer->processChunkily(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#4 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#5 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/HTML5TreeBuilder.php(420): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#6 [internal function]: Wikimedia\Parsoid\Wt2Html\HTML5TreeBuilder->processChunkily(string, array)
#7 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(900): Generator->current()
#8 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(152): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#9 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(202): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#10 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#11 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Core/WikitextContentModelHandler.php(81): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#12 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(161): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM(Wikimedia\Parsoid\Config\Env)
#13 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(193): Wikimedia\Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#14 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(588): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#15 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(88): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#16 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(1047): MWParsoid\Rest\Handler\PageHandler->realExecute()
#17 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(381): MWParsoid\Rest\Handler\ParsoidHandler->execute()
#18 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(316): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#19 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(155): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#20 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(119): MediaWiki\Rest\EntryPoint->execute()
#21 /srv/mediawiki/php-1.36.0-wmf.11/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#22 /srv/mediawiki/w/rest.php(3): require(string)
#23 {main}

This crasher is actually due to bad UTF-8 in the database... which I'm pretty sure we're not ever supposed to let get stored. Some sanity check is missing in either the PST or the edit API, I guess.

The page is https://ja.wikipedia.org/wiki/Wikipedia%3A%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2%2F%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88 and the bad UTF-8 is visible in the wikitext source. In the Parsoid/JS port the page source would get cleaned up by the action API when we fetched it, and the standalone-mode Parsoid works fine for the same reason:

$ php bin/parse.php --restURL  /w/rest.php/ja.wikipedia.org/v3/page/pagebundle/Wikipedia%3A%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2%2F%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88/2368296

But in integrated mode we get the source wikitext direct from the DB with the bad UTF-8 right there, and the tokenizer's assertion catches this pretty early.

Fix is probably (a) clean the DB, (b) find how this managed to get stored in the DB and fix it, and (c) clean bad UTF-8 when the wikitext source gets fetched, logging a warning with the page title instead of crashing.

Jonesey95 added a comment.EditedOct 21 2020, 4:11 AM

Sorry, that's not how formatnum works. It takes a PHP numeric string and formats it for human consumption, as per http://unicode.org/reports/tr35/tr35-numbers.html#Number_Format_Patterns and https://www.php.net/manual/en/class.numberformatter.php. It is intended for making the output of computed values more human-friendly.
This is how it has always worked, it just used to be more tolerant and pass through non-numeric characters (like U+2212) unmodified, bypassing localization and user numeric preferences. Please open a different phab task if you would like to change it (I'd be supportive of making the *output* of formatnum use a U+2212 for instance).

The error is not in formatnum; it's in the error-detection code. {{formatnum:-9.1}} (using a hyphen-minus, which is a non-numeric character) does not throw an error. I guess I'll just try to document this behavior and let someone else deal with it. We'll probably end up having to roll our own number formatting template at en.WP to replicate the old, error-message-free behavior of formatnum. That seems like a waste of volunteer resources.

As for your being supportive of having the output of formatnum generate an actual minus sign, that is good news. I believe that the 14-year-old feature request is T10327.

@Jonesey95 the issue is that the underlying unicode number-formatting code doesn't recognize a non-ascii sign there. As we move into the global templates era, where the same template code is going to (hopefully) be shared by multiple wikis, it's important to clearly document the expected input and output so that localization works correctly. If you use a non-ascii minus sign, it will get copied literally to the output, but it will not be understood as a minus sign and thus not localized. This means that (for instance) an Arabic ALM character will be generated in the wrong place -- between the U+2212 and the number, instead of before the minus sign. As it turns out, we strip the ALM/RLM/LRM characters currently, but this set of bug fixes is motivated by issues in commafy where folks were passing non-numeric data to commafy and it ended up generating bad UTF-8 as a result -- but only on wikis which didn't use the standard "3 digit grouping" rules for commas. Those sorts of bugs are hard to identify because they disproportionately affect non-US/non-Euro wikis. A bit of strictness here helps ensure we can adequately support wikis in all languages, even if they have "unusual" requirements for how negative numbers are formatted, or how commas are inserted, etc.

A common mistake is taking the output of formatnum and feeding it back to formatnum or to expr or whatnot. Again, that "usually" works when the numbers are small and you're in the US locale, but breaks horribly if you have 2-digit comma grouping, or a fractional part and swapped comma and period formatting. I guess a good argument for T10327: Language::formatNum() should prefix negative values with − (minus sign U+2212) is that it will break these uses even more often (by breaking them if the number is negative), which in theory helps us keep the inputs clean (and thus localizable...)

Incidentally, one of the options to the unicode number formatting code is the "accountant style" use of parenthesis for negative numbers, like (123,421). That's another example of how the formatted *output* from formatnum should be distinguished from the *input* to formatnum.

cscott closed this task as Resolved.Oct 21 2020, 1:37 PM

Resolving this issue, which was bad UTF-8 generated by commafy (indirectly from formatnum). Opened new task T266129: Invariant failed: Bad UTF-8 (full string verification) -- bad UTF-8 from database for the case which @thcipriani described in T237467#6543711 as that is a different root cause (bad UTF-8 in wikitext stored in the DB). As @Jonesey95 pointed out, T10327: Language::formatNum() should prefix negative values with − (minus sign U+2212) already exists for the issue of U+2212 in the *output* of formatnum; I suppose that's a reasonable place to continue the discussion of U+2212 in the *input* of formatnum as well if more needs to be said on that.

For the purposes of a potential item in Tech News, please could someone clarify whether this is a bug-fix that all the wikis need to learn about? I've tried to understand it by skimming the thread, but I am unable.
If you believe it does need to be communicated broadly, then please help by drafting 1-2 sentences explaining how, and how much, did it impact users. E.g. "There was a problem with X that resulted in Y. This has now been fixed." Note: we try to only send out information about the most high-impact bugs, that affected many wiki communities, so as to not deluge everyone with too much information. (cf. contribution docs)
If you believe it does not need to be communicated broadly, then just remove the #user-notice tag!
Much thanks.

Krinkle removed a subscriber: Krinkle.Oct 22 2020, 11:59 PM

@Quiddity worth notifying the wikis about the new Category:Pages with non-numeric formatnum arguments. The English wiki page has a good description of what causes pages to be added to this category and how to fix them; thanks x100 to @Jonesey95 for writing that up so well. The initiating bug (T237467) was some poorly-written code dealing with languages with non-default numeric grouping, which could cause invalid UTF-8 when fed non-numeric data (which it wasn't expecting). Cleaning up this crufty code (in part by more-rigorously defining expected inputs to formatNum/commafy etc) allowed (at the end of a long chain of clean up patches) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/384006, which brings mediawiki's number formatting up to date with TR35 and latest CLDR using the native i18n features of PHP7 (T167088).

I think the user-visible change is that {{formatnum}} can now handle numbers with fractional parts (ie, decimals) on wikis which use non-standard digit grouping and/or minimum group size. That's probably arq/ar/be_tarask/ml/nl/pl at least (just judging from the test cases updated in ce8d0e9599a84565d53965481d1c163a90c4e6dd).

@cscott From interwiki links, I see a few are localized. Should that be done onwiki or at translatewiki?
Ah, yes, after searching down various rabbit-holes I found it at https://translatewiki.net/wiki/Special:Translations?message=Nonnumeric-formatnum&namespace=8
Aha, and from that, I found the page where these are meant to be documented! https://www.mediawiki.org/wiki/Help:Tracking_categories - It is not currently listed there so please add it! :-)
For Tech News, I've spent over an hour wrestling with how much to say, and it's still very complex yet missing details, but I need to freeze this week's edition for translation, so I will put it in next week's edition (for delivery 2 November) - please amend or improve the item at https://meta.wikimedia.org/wiki/Tech/News/2020/45 or comment on the talkpage there. Thanks!

Quiddity updated the task description. (Show Details)Oct 29 2020, 9:16 PM
Dalba awarded a token.Tue, Nov 3, 12:21 PM
Lofhi added a subscriber: Lofhi.EditedTue, Nov 10, 10:02 PM

Is it normal than a number under its scientific notation is invalid now? Example: 1.0E-6 is invalid. It seems that it was not the case before, because the regex was only used if the value was not a numeric for PHP with is_numeric. Now, even if the scientific notation is valid, the function getLegacyFormatNum adds the category because one test is done before. I.e: {{formatnum:{{#expr:0.000001}}}}` breaks. Am I talking nonsense?

Nope, good catch. Patch incoming.

Change 640575 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] language: Don't add formatNum tracking category for #s in exponential notation

https://gerrit.wikimedia.org/r/640575

Change 640575 merged by jenkins-bot:
[mediawiki/core@master] language: Don't add formatNum tracking category for #s in exponential notation

https://gerrit.wikimedia.org/r/640575