Page MenuHomePhabricator

"Invariant failed: Bad UTF-8 (full string verification)" due to content in database (from Pasoid PegTokenizer)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error message
Invariant failed: Bad UTF-8 (full string verification)
Notes

The bad UTF-8 can also be seen directly in the source for:
https://ja.wikipedia.org/wiki/Wikipedia:%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2/%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88

This task is forked from T237467, which ended up being an issue with Language::commafy generating bad UTF-8. In contrast, in this task the bad UTF-8 is coming directly from the DB. As described in T237467#6566785, we need the following mitigations:

  1. Bad UTF-8 is not supposed to make it past PST to get stored in the DB in the first place. So we need to track down how it got in there and clean it up; also perhaps cleaning up other articles that managed to get saved with bad UTF-8.
  2. Fix core to plug this hole so that bad UTF-8 is not stored in the DB.
  3. Validate wikitext source we get from the DB and fix up bad UTF-8 we get, downgrading this from a crasher to a warning. (The assertion is still appropriate if we encounter bad UTF-8 later, since that would be generated by Parsoid from valid inputs; but Parsoid operates under the assumption that all of its inputs are valid.)

Details

Request ID
f6017f38-e1bf-4bf5-a3a1-5388d1acfa6e
Request URL
/w/rest.php/ja.wikipedia.org/v3/page/pagebundle/Wikipedia%3A%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2%2F%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88/2368296
Stack Trace
#0 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(258): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(115): Wikimedia\Parsoid\Utils\PHPUtils::assertValidUTF8(string)
#2 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\PegTokenizer->processChunkily(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#4 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#5 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/HTML5TreeBuilder.php(420): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#6 [internal function]: Wikimedia\Parsoid\Wt2Html\HTML5TreeBuilder->processChunkily(string, array)
#7 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(900): Generator->current()
#8 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(152): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#9 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(202): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#10 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#11 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Core/WikitextContentModelHandler.php(81): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#12 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(161): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM(Wikimedia\Parsoid\Config\Env)
#13 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(193): Wikimedia\Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#14 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(588): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#15 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(88): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#16 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(1047): MWParsoid\Rest\Handler\PageHandler->realExecute()
#17 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(381): MWParsoid\Rest\Handler\ParsoidHandler->execute()
#18 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(316): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#19 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(155): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#20 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(119): MediaWiki\Rest\EntryPoint->execute()
#21 /srv/mediawiki/php-1.36.0-wmf.11/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#22 /srv/mediawiki/w/rest.php(3): require(string)
#23 {main}

Event Timeline

ssastry triaged this task as Medium priority.Oct 22 2020, 9:08 PM
ssastry moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.
ssastry renamed this task from Invariant failed: Bad UTF-8 (full string verification) -- bad UTF-8 from database to Bad UTF-8 content in database: UTF-8 errors in PegTokenizer.Dec 9 2020, 2:09 PM
Krinkle renamed this task from Bad UTF-8 content in database: UTF-8 errors in PegTokenizer to "Invariant failed: Bad UTF-8 (full string verification)" due to content in database (from Pasoid PegTokenizer).Sep 8 2022, 4:26 PM
Krinkle subscribed.

Logstash query message:"Bad UTF-8" AND "full string verification" on the mediawiki-errors dashboard shows there is still a trickle of these every other day, which suggests that a subset of pages continue to be inaccessible when read or edited via Parsoid.

Well, that page doesn't render with the legacy parser either, and logstash tells us: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded. Here is a partial trace:

from /srv/mediawiki/php-1.40.0-wmf.5/includes/MagicWordArray.php(319)
#0 /srv/mediawiki/php-1.40.0-wmf.5/includes/parser/Parser.php(4101): MagicWordArray->matchAndRemove(string)
#1 /srv/mediawiki/php-1.40.0-wmf.5/includes/parser/Parser.php(1624): Parser->handleDoubleUnderscore(string)
#2 /srv/mediawiki/php-1.40.0-wmf.5/includes/parser/Parser.php(712): Parser->internalParse(string)
#3 /srv/mediawiki/php-1.40.0-wmf.5/includes/content/WikitextContentHandler.php(301): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer)
...

This one is genuinely bad content in the database that needs to be fixed up.

Other than this, there are 32 logstash entries in the last 3 months from zhwiki, arwiki, viwikisource. So, definitely much better from before.

Well, that page doesn't render with the legacy parser either, and logstash tells us: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded. Here is a partial trace:
..
This one is genuinely bad content in the database that needs to be fixed up.

Someone with edit rights on the page can at least fix up the page (from 2004!) .. there are bad utf-8 chars on this line around <li>12.43, 8 mai 2004 [[Brûker:Robbot|Robbot]] "[[:Ofbyld:Test.png|Test.png]]" oanbean <em>(Tes .... Similar on the next line. This won't do anything for the previous revision, but will at least let this page render and eliminate the parsoid utf-8 errors as well.

MSantos claimed this task.
MSantos subscribed.

This seems to be resolved, please re-open in case I'm missing something.