Page MenuHomePhabricator

Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at start of string
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at start of string
exception.trace
from /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/assert/src/Assert.php(231)
#0 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(178): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/PP/Processors/DOMRangeBuilder.php(977): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr(string, integer, integer)
#2 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/PP/Processors/DOMRangeBuilder.php(1303): Wikimedia\Parsoid\Wt2Html\PP\Processors\DOMRangeBuilder->encapsulateTemplates(array)
#3 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(21): Wikimedia\Parsoid\Wt2Html\PP\Processors\DOMRangeBuilder->execute(Wikimedia\Parsoid\DOM\Element)
#4 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(157): Wikimedia\Parsoid\Wt2Html\PP\Processors\WrapTemplates->run(Wikimedia\Parsoid\Config\Env, Wikimedia\Parsoid\DOM\Element, array, boolean)
#5 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(868): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->Wikimedia\Parsoid\Wt2Html\{closure}(Wikimedia\Parsoid\DOM\Element, array, boolean)
#6 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(909): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->doPostProcess(Wikimedia\Parsoid\DOM\Element)
#7 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(927): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->process(Wikimedia\Parsoid\DOM\Element)
#8 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(180): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#9 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#10 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Wikitext/ContentModelHandler.php(130): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#11 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Parsoid.php(174): Wikimedia\Parsoid\Wikitext\ContentModelHandler->toDOM(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI)
#12 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/src/Parsoid.php(216): Wikimedia\Parsoid\Parsoid->parseWikitext(MediaWiki\Parser\Parsoid\Config\PageConfig, ParserOutput, array)
#13 /srv/mediawiki/php-1.41.0-wmf.4/includes/parser/Parsoid/ParsoidOutputAccess.php(298): Wikimedia\Parsoid\Parsoid->wikitext2html(MediaWiki\Parser\Parsoid\Config\PageConfig, array, NULL, ParserOutput)
#14 /srv/mediawiki/php-1.41.0-wmf.4/includes/parser/Parsoid/ParsoidOutputAccess.php(465): MediaWiki\Parser\Parsoid\ParsoidOutputAccess->parseInternal(MediaWiki\Parser\Parsoid\Config\PageConfig, array)
#15 /srv/mediawiki/php-1.41.0-wmf.4/includes/parser/Parsoid/ParsoidOutputAccess.php(244): MediaWiki\Parser\Parsoid\ParsoidOutputAccess->parse(MediaWiki\Page\PageStoreRecord, ParserOptions, array, MediaWiki\Revision\RevisionStoreRecord)
#16 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/Handler/Helper/HtmlOutputRendererHelper.php(705): MediaWiki\Parser\Parsoid\ParsoidOutputAccess->getParserOutput(MediaWiki\Page\PageStoreRecord, ParserOptions, MediaWiki\Revision\RevisionStoreRecord, integer)
#17 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/Handler/Helper/HtmlOutputRendererHelper.php(532): MediaWiki\Rest\Handler\Helper\HtmlOutputRendererHelper->getParserOutputInternal(ParserOptions)
#18 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/Handler/Helper/HtmlOutputRendererHelper.php(628): MediaWiki\Rest\Handler\Helper\HtmlOutputRendererHelper->getParserOutput()
#19 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/Handler/ParsoidHandler.php(913): MediaWiki\Rest\Handler\Helper\HtmlOutputRendererHelper->getPageBundle()
#20 /srv/mediawiki/php-1.41.0-wmf.4/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(92): MediaWiki\Rest\Handler\ParsoidHandler->wt2html(MediaWiki\Parser\Parsoid\Config\PageConfig, array)
#21 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/Router.php(517): MWParsoid\Rest\Handler\PageHandler->execute()
#22 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/Router.php(422): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#23 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/EntryPoint.php(195): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#24 /srv/mediawiki/php-1.41.0-wmf.4/includes/Rest/EntryPoint.php(135): MediaWiki\Rest\EntryPoint->execute()
#25 /srv/mediawiki/php-1.41.0-wmf.4/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#26 /srv/mediawiki/w/rest.php(3): require(string)
#27 {main}
Impact
Notes

Details

Request URL
https://ps.wikipedia.org/w/rest.php/ps.wikipedia.org/v3/page/pagebundle/%D9%84%D9%8A%D9%8A%D8%AA/295634

Event Timeline

Reproducible locally as below:

~/work/wmf/parsoid (master ✘)✭ ᐅ php bin/parse.php --domain ps.wikipedia.org --pageName لييت --oldid 295634 < /dev/null
Wikimedia\Assert\InvariantException from line 231 of /home/subbu/work/wmf/parsoid/vendor/wikimedia/assert/src/Assert.php: Invariant failed: Bad UTF-8 at start of string
#0 /home/subbu/work/wmf/parsoid/src/Utils/PHPUtils.php(178): Wikimedia\Assert\Assert::invariant()
#1 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/DOMRangeBuilder.php(977): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr()
#2 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/DOMRangeBuilder.php(1303): Wikimedia\Parsoid\Wt2Html\PP\Processors\DOMRangeBuilder->encapsulateTemplates()
#3 /home/subbu/work/wmf/parsoid/src/Wt2Html/PP/Processors/WrapTemplates.php(21): Wikimedia\Parsoid\Wt2Html\PP\Processors\DOMRangeBuilder->execute()
...

Probably another bad-dsr edge case.

This is reproducible and after a brief investigation, it looks a result of fostered content in a table causing bad DSR values for the table. To be continued.

This bug essentially boils down to a difference in how fostering works in the HTML5 parsing spec! You can try this in your browser. Create a file with the HTML strings. Open the file, inspect the body element and dump its inner HTML.

Consider this html string:

<table><tbody><tr>
foo

<td>x</td>
</tr></tbody></table>

When that string is parsed to a DOM and then serialized, the output string is:

foo

<table><tbody><tr><td>x</td>
</tr></tbody></table>

Notice how all the empty newlines following foo were also fostered out.

But, consider this HTML string

<table><tbody><tr>
<i>foo</i>

<td>x</td>
</tr></tbody></table>

When this string is parsed to a DOM and then serialized, the output string is:

<i>foo</i><table><tbody><tr>


<td>x</td>
</tr></tbody></table>

Now, see how only the i-tag is fostered out but the two newlines were left behind!!

That, in a nutshell, is the source of trouble. Those newlines mess up the DSR computation which then downstream leads to invalid offsets when trying to extract substrings of a multi-byte utf-8 string.

To fix this, we'll need to make the DSR algorithm more robust.

My phab skills are not helping right now, but this is another reason to explore possibilities where we catch exceptions in DOM passes that are optional for wt2html read-view HTML generation and just mark the page with error info that prevents it from being opened for editing in VE clients and also signals to users that meaningful semantic markup is missing.

The page in question also has garbage wikitext that should be fixed separately. At some point, this becomes a game of diminishing returns of trying to catch all the ways broken wikitext is used on pages

Here is the smallest piece of convoluted wikitext (which mimics that found on the pswiki page actually!) that reproduces the problem:

{|
|-

é

''x''

* {{1x|
{{{!}}
{{!}}-
}}

The multi-byte char on line 4 with the accent is required for a bad utf-8 offset to be exercised.

Change 939396 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] MigrateTemplateMarkerMetas: Update fostered markers for metas

https://gerrit.wikimedia.org/r/939396

ssastry triaged this task as Medium priority.Jul 20 2023, 4:09 AM

Change 939396 merged by jenkins-bot:

[mediawiki/services/parsoid@master] MigrateTemplateMarkerMetas: Update fostered markers for metas

https://gerrit.wikimedia.org/r/939396

Change 940966 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a18

https://gerrit.wikimedia.org/r/940966

Change 940966 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a18

https://gerrit.wikimedia.org/r/940966