Page MenuHomePhabricator

WrapTemplates: UTF-8 errors
Open, MediumPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.36.0-wmf.20

message
Invariant failed: Bad UTF-8 at start of string

Impact

Notes

The causes are varied. In some cases, broken markup in templates seems to trigger this when Parsoid cannot then gracefully recover from the mess that causes. In other cases, edge case bugs in Parsoid (see T277415) trigger this.

Details

Request ID
X9C2UApAIOYAAB4xo5wAAAAO
Request URL
https://ar.wikipedia.org/w/rest.php/ar.wikipedia.org/v3/page/pagebundle/%D9%86%D9%82%D8%A7%D8%B4%3A%D9%86%D9%88%D8%B1_%D8%A7%D9%84%D8%AF%D9%8A%D9%86_%D8%AC%D9%87%D8%A7%D9%86%D9%83%D9%8A%D8%B1/51013602
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(192): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Tokens/SourceRange.php(82): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr(string, integer, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1082): Wikimedia\Parsoid\Tokens\SourceRange->substr(string)
#3 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1245): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates::encapsulateTemplates(DOMDocument, Wikimedia\Parsoid\Wt2Html\PageConfigFrame, array, array)
#4 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(1258): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates::wrapTemplatesInTree(DOMDocument, Wikimedia\Parsoid\Wt2Html\PageConfigFrame, DOMElement)
#5 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(159): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates->run(Wikimedia\Parsoid\Config\Env, DOMElement, array, boolean)
#6 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(857): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->Wikimedia\Parsoid\Wt2Html\{closure}(DOMElement, array, boolean)
#7 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(907): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->doPostProcess(DOMElement)
#8 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(924): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->process(DOMElement)
#9 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(174): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#10 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(235): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#11 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#12 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Core/WikitextContentModelHandler.php(106): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#13 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Parsoid.php(162): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM(Wikimedia\Parsoid\Config\Env)
#14 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/src/Parsoid.php(194): Wikimedia\Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#15 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(589): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#16 /srv/mediawiki/php-1.36.0-wmf.20/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(88): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#17 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/Router.php(389): MWParsoid\Rest\Handler\PageHandler->execute()
#18 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/Router.php(316): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#19 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/EntryPoint.php(153): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#20 /srv/mediawiki/php-1.36.0-wmf.20/includes/Rest/EntryPoint.php(117): MediaWiki\Rest\EntryPoint->execute()
#21 /srv/mediawiki/php-1.36.0-wmf.20/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#22 /srv/mediawiki/w/rest.php(3): require(string)
#23 {main}

Event Timeline

ssastry triaged this task as Medium priority.Dec 9 2020, 2:05 PM

There is effectively wikitext on that arwiki page that has this pattern:

{{1x|<table><tr>''}}
{{1x|<tr><td>x</tr></td>}}
some string
{{1x|</table>}

which messes up the DOM sufficiently enough to mess with DSR computation and Template wrapping. And, in the case of arwiki, this causes template wrapping to trying to extract a substring with bogus offsets that don't correspond to character offsets causing the UTF-8 error to trigger.

If the '' is stripped from that first template, all is well. That stray '' quote tag causes the closing '' to be autoinserted that happens to end up outside the template end meta marker which then causes havoc downstream. We can see if the QuoteTransformer can be made a tiny bit smarter to avoid auto-closing the stray i-tag after a template end-marker. But, this doesn't really prevent the core issue of messed up DOMs because of broken wikitext that interacts badly with a HTML5 tree builder.

The core problem to solve here is the one outlined in T191641: Graceful degradation of generated HTML in the face of template-wrapping, section-wrapping, or other edit-client-support failures.

For completeness' sake so if we want to go fix the template on arwiki, the reduced arwiki wikitext that triggers the above UTF-8 error is:

{{ﺏﺩﺎﻳﺓ ﻖﺼﻳﺩﺓ}}
{{ﺐﻴﺗ|ﺍﺯ(ﻢﻧ) ﻢﻧ(ﺄﻧﺍ) ﻢﺗﺎﺑ(ﻻ ﺕﻮﻟ) ﺮﺧ(ﺎﻟﻮﺠﻫ) ﻚﻫ(ﺎﻟﺬﻳ) ﻦﻴﻣ(ﻦﺼﻓ) ﺐﯾ(ﺏﺩﻮﻧ) ﺕﻭ(ﺄﻨﺗ) ﻲﻛ(ﻭﺎﺣﺩ) ﻦﻔﺳ(ﺍﻼﺴﺘﻨﺷﺎﻗ/ﻥ)|ﻲﻛ(ﻭﺎﺣﺩ) ﺪﻟ(ﻖﻠﺑ) ﺶﻜﺴﺘﻧ(ﻚﺳﺭ) ﺕﻭ(ﺄﻨﺗ) ﺏ+ﺹﺩ(ﺏ+ﻡﺎﺋﺓ) ﺥﻮﻧ(ﺪﻣ) ﺏﺭﺎﺑﺭ(ﻢﺳﺍﻮﻳ) ﺎﺴﺗ(ﻲﻛﻮﻧ)}}
ﻢﻨﻳ ﻼﺗﻮﻠﻳ ﺎﻟﻮﺠﻫ ﺎﻟﺬﻳ ﺏﺩﻮﻨﻛ ﻦﺼﻓ ﺍﻼﺴﺘﻨﺷﺎﻗ ﺎﻟﻭﺎﺣﺩ=ﻻ ﺕﻮﻠﻳ ﻊﻠﻳ ﺎﻟﻮﺠﻫ ﺢﺘﯾ ﻞﻔﺗﺭﺓ ﻮﺠﻳﺯﺓ/ﻚﺳﺭ ﻖﻠﺒﻛ ﻡﺭﺓ ﻭﺎﺣﺩﺓ ﻲﻛﻮﻧ ﻢﺳﺍﻮﻳﺍ ﺐﻣﺎﺋﺓ ﺪﻣ
{{ﻦﻫﺎﻳﺓ ﻖﺼﻳﺩﺓ}}

That first template expands to:

{|style=" ; background-color:transparent; margin:auto auto;"

|-
''

So, if we go in and fix that template, we can avoid the UTF-8 error here.

But, on the Parsoid end, it is time to address T191641: Graceful degradation of generated HTML in the face of template-wrapping, section-wrapping, or other edit-client-support failures.

ssastry updated the task description. (Show Details)

Change 672161 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: ListHandler: when in EOL state, close lists always

https://gerrit.wikimedia.org/r/672161

Change 672161 merged by jenkins-bot:
[mediawiki/services/parsoid@master] ListHandler: when in EOL state, close lists always

https://gerrit.wikimedia.org/r/672161

Change 675310 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738

Change 675310 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675310

Change 675738 merged by jenkins-bot:
[mediawiki/vendor@wmf/1.36.0-wmf.37] Bump wikimedia/parsoid to 0.13.0-a30

https://gerrit.wikimedia.org/r/675738