Page MenuHomePhabricator

PegTokenizer: UTF-8 errors
Open, MediumPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.36.0-wmf.20

message
Invariant failed: Bad UTF-8 (full string verification)

Impact

Notes

Details

Request ID
X9CFMgpAAPAAADoVRCQAAABD
Request URL
https://fr.wikipedia.org/w/rest.php/fr.wikipedia.org/v3/page/pagebundle/Isil_(crat%C3%A8re_martien)/175673786
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(258): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(89): Wikimedia\Parsoid\Utils\PHPUtils::assertValidUTF8(string)
#2 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(136): Wikimedia\Parsoid\Wt2Html\PegTokenizer->process(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Utils/PipelineUtils.php(113): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parse(string, array)
#4 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TT/TemplateHandler.php(685): Wikimedia\Parsoid\Utils\PipelineUtils::processContentInPipeline(Wikimedia\Parsoid\Config\Env, Wikimedia\Parsoid\Wt2Html\PageConfigFrame, string, array)
#5 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TT/TemplateHandler.php(1525): Wikimedia\Parsoid\Wt2Html\TT\TemplateHandler->processTemplateSource(array, array, string)
#6 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TT/TemplateHandler.php(1577): Wikimedia\Parsoid\Wt2Html\TT\TemplateHandler->onTemplate(Wikimedia\Parsoid\Tokens\SelfclosingTagTk)
#7 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TT/TokenHandler.php(202): Wikimedia\Parsoid\Wt2Html\TT\TemplateHandler->onTag(Wikimedia\Parsoid\Tokens\SelfclosingTagTk)
#8 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(123): Wikimedia\Parsoid\Wt2Html\TT\TokenHandler->process(array)
#9 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(195): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunk(array)
#10 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(193): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#11 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/HTML5TreeBuilder.php(417): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#12 [internal function]: Wikimedia\Parsoid\Wt2Html\HTML5TreeBuilder->processChunkily(string, array)
#13 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(920): Generator->current()
#14 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(174): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#15 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(235): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#16 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#17 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Core/WikitextContentModelHandler.php(106): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#18 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Parsoid.php(162): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM(Wikimedia\Parsoid\Config\Env)
#19 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Parsoid.php(194): Wikimedia\Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#20 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(589): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#21 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(88): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#22 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/Router.php(389): MWParsoid\Rest\Handler\PageHandler->execute()
#23 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/Router.php(316): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#24 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/EntryPoint.php(153): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#25 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/EntryPoint.php(117): MediaWiki\Rest\EntryPoint->execute()
#26 /srv/mediawiki/php-1.36.0-wmf.20/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#27 /srv/mediawiki/w/rest.php(3): require(string)
#28 {main}

Event Timeline

ssastry triaged this task as Medium priority.Dec 9 2020, 2:07 PM

This is still happening. See reqId X9pJDApAMKEAAFOXG0UAAAAV

For the frwiki:Rheasilvia page, it is the infobox that is the problem:

{{Infobox Relief
 | nom        = 
 | image      = 
 | légende    = 
 | latitude   =
 | longitude  =
 | références =
 | région     =
 | type       = 
 | diamètre   =
 | hauteur    = 
 | pculminant =
 | dcaldeira  =
 | éponyme    = 
 | ±latitude  =
 | ±longitude =
}}

It looks very likely that we are receiving bad utf-8 from the template expansion. We should instrument the code on scandium to emit the expanded wikitext and inspect it.

Here is the dump of what we get from the preprocessor: The latitude values seem to have some broken output. @cscott: does this look like another formatnum issue in a template / module?

<div class="infobox infobox_v3 large"><div class="entete " style="background-color:#FFDEAD;color:#000000"><div>Rheasilvia</div></div><div><div class="images" style="padding:2px 0">[[Fichier:Vesta from Dawn, July 17.jpg|frameless|upright=1.2]]</div><div class="legend">Vue de l'hémisphère sud de Vesta. On peut y voir le cratère Rheasilvia.</div></div><table><caption colspan="2" style="color:#000000;text-align:center;background-color:#FFDEAD">Géographie</caption><tr class=""><th scope="row">Astre</th><td class=""><div>
<span class="wd_p376">[[(4) Vesta|(4) Vesta]][[Category:Page utilisant P376]]<span class="noprint wikidata-linkback">[[File:Blue pencil.svg|Voir et modifier les données sur Wikidata|10px|baseline|class=noviewer|link=https://www.wikidata.org/wiki/Q2631008?uselang=fr#P376]]</span></span></div></td></tr><tr class=""><th scope="row">Coordonnées</th><td class=""><div>
<span class="wd_p625"><span class="plainlinks nourlexpansion" title="Cartes, vues aériennes, etc.">[http://tools.wmflabs.org/geohack/geohack.php?language=fr&pagename=Rheasilvia&params=71.95_S_86.3_E__globe:vesta<span class="h-geo geo-dec"><data class="p-latitude" value="-71.95">��71,95° S</data>, <data class="p-longitude" value="86.3">86,3° E</data><data class="p-globe" value="vesta"></data></span>]</span><indicator name="coordinates"><span id="coordinates" class="noprint"><span class="plainlinks nourlexpansion" title="Cartes, vues aériennes, etc.">[http://tools.wmflabs.org/geohack/geohack.php?language=fr&pagename=Rheasilvia&params=71.95_S_86.3_E__globe:vesta<span class="h-geo geo-dec"><data class="p-latitude" value="-71.95">��71,95° S</data>, <data class="p-longitude" value="86.3">86,3° E</data><data class="p-globe" value="vesta"></data></span>]</span></span></indicator>[[Category:Page géolocalisée par Wikidata]][[Category:Article géolocalisé extraterrestre]]<span class="noprint wikidata-linkback">[[File:Blue pencil.svg|Voir et modifier les données sur Wikidata|10px|baseline|class=noviewer|link=https://www.wikidata.org/wiki/Q2631008?uselang=fr#P625]]</span></span></div></td></tr><tr class=""><th scope="row">Diamètre</th><td class=""><div>
<span class="wd_p2386">450  km[[Category:Page utilisant P2386]]<span class="noprint wikidata-linkback">[[File:Blue pencil.svg|Voir et modifier les données sur Wikidata|10px|baseline|class=noviewer|link=https://www.wikidata.org/wiki/Q2631008?uselang=fr#P2386]]</span></span></div></td></tr></table><table><caption colspan="2" style="color:#000000;text-align:center;background-color:#FFDEAD">Géologie</caption><tr class=""><th scope="row">Type</th><td class=""><div>
<span class="wd_p31">[[Cratère d'impact|Cratère d'impact]][[Category:Page utilisant P31]]<span class="noprint wikidata-linkback">[[File:Blue pencil.svg|Voir et modifier les données sur Wikidata|10px|baseline|class=noviewer|link=https://www.wikidata.org/wiki/Q2631008?uselang=fr#P31]]</span></span></div></td></tr></table><table><caption colspan="2" style="color:#000000;text-align:center;background-color:#FFDEAD">Exploration</caption><tr class=""><th scope="row">Éponyme</th><td class=""><div>
<span class="wd_p138">[[Rhéa Silvia|Rhéa Silvia]][[Category:Page utilisant P138]]<span class="noprint wikidata-linkback">[[File:Blue pencil.svg|Voir et modifier les données sur Wikidata|10px|baseline|class=noviewer|link=https://www.wikidata.org/wiki/Q2631008?uselang=fr#P138]]</span></span></div></td></tr></table><div class="img_toogle"><div class="geobox" style="clear:right;word-wrap:break-word;align-items:center;width:auto;justify-content:space-around;text-align:center;max-width:99%;height:auto;line-height:1.4em;margin:0 0 0.5em 1em;font-size:0.9em">Localisation sur la carte de Vesta<table class="DebutCarte" cellpadding="0" border="0" cellspacing="0" style="border:none;padding:0;margin:0;width:auto"><tr><td><div style="text-align:right;width:100%;margin:auto;position:relative">[[File:Vesta map for GeoHack.png|frameless|280px|voir sur la carte de Vesta|class=noviewer]]<div style="top:89.972222222222%;left:73.972222222222%;border:none;position:absolute"><div style="top:-4px;position:absolute;line-height:0;left:-4px;width:8px">[[Fichier:Red pog.svg|8px|class=noviewer]]<span style="width:150px;text-align:left;position:absolute"></span></div></div></div></td></tr></table></div><span></span></div><p class="navbar noprint bordered " style="border-top:1px solid #FFDEAD"><span class="plainlinks" style="text-align:left">[//fr.wikipedia.org/w/index.php?title=Rheasilvia&veaction=edit&section=0 modifier] - [//fr.wikipedia.org/w/index.php?title=Rheasilvia&action=edit&section=0 modifier le code] - [[d:Q2631008|modifier Wikidata]]</span><span style="text-align:right">[[Fichier:Info Simple.svg|12px|link=Modèle:Infobox Relief|Documentation du modèle]]</span></p></div>[[Category:Page utilisant P18]][[Catégorie:Article utilisant l'infobox Relief]][[Catégorie:Article utilisant une Infobox]]

I traced this to this code in https://fr.wikipedia.org/w/index.php?title=Module:Coordinates .. so, yes, the negative sign is coming out of formatNum as bad utf-8.

function p.displaydec(latitude, longitude, format)
   lat = lang:formatNum( latitude )
   long = lang:formatNum( longitude )
...

I cannot imagine something like this is generically broken .. we would see more widespread breakage. It appears that the associated coords are looked up on wikidata, and so maybe that is where is something wrong?

The infoboxes on https://fr.wikipedia.org/wiki/Vibidia_(crat%C3%A8re) and https://fr.wikipedia.org/wiki/Rheasilvia both display the tofu chars.
The original page on this bug report is https://fr.wikipedia.org/wiki/Isil_(crat%C3%A8re_martien) which also has the same problem.
and https://fr.wikipedia.org/wiki/Darwin_(crat%C3%A8re_martien) as well.

https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Crat%C3%A8re_d%27impact_sur_Mars has more pages and I imagine if we went through this list, we'll find more instances.

So, maybe it is not wikidata data and is actually the formatting / processing code.

The unicode minus sign is from formatnum -- it shouldn't be getting chopped up into bad UTF-8, unless someone somewhere it doing a naive substr(1, ...) or something like that. I'll look.

I strongly suspect that someone is converting "-71.3" degrees to "71.3 S" by chopping off the first *byte*, instead of the first *character*.

Yeah, this is a bug in the lua code. I've attempted to contact the author: https://fr.wikipedia.org/w/index.php?title=Discussion_module%3ACoordinates&type=revision&diff=177884216&oldid=173976505

Since most of the remaining cases of this issue are due to errors in lua string-handling code, I'll look into trying to put a UTF-8 validity check on the output of Scribunto, so that these sorts of errors can be turned into a tracking category and don't show up further down the line as production errors from Parsoid.

Looks like ptwiki has some similar module / template issue. The infobox in https://pt.wikipedia.org/wiki/Cam%C3%B5es_(cratera) has the same issue (and a parsoid log message in logstash). Just need to trace the module calls and figure out where this shows up.