Page MenuHomePhabricator

Bad UTF-8 content in database: UTF-8 errors in PegTokenizer
Open, MediumPublicPRODUCTION ERROR

Description

Error message
Invariant failed: Bad UTF-8 (full string verification)
Notes

The bad UTF-8 can also be seen directly in the source for:
https://ja.wikipedia.org/wiki/Wikipedia:%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2/%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88

This task is forked from T237467, which ended up being an issue with Language::commafy generating bad UTF-8. In contrast, in this task the bad UTF-8 is coming directly from the DB. As described in T237467#6566785, we need the following mitigations:

  1. Bad UTF-8 is not supposed to make it past PST to get stored in the DB in the first place. So we need to track down how it got in there and clean it up; also perhaps cleaning up other articles that managed to get saved with bad UTF-8.
  2. Fix core to plug this hole so that bad UTF-8 is not stored in the DB.
  3. Validate wikitext source we get from the DB and fix up bad UTF-8 we get, downgrading this from a crasher to a warning. (The assertion is still appropriate if we encounter bad UTF-8 later, since that would be generated by Parsoid from valid inputs; but Parsoid operates under the assumption that all of its inputs are valid.)

Details

Request ID
f6017f38-e1bf-4bf5-a3a1-5388d1acfa6e
Request URL
/w/rest.php/ja.wikipedia.org/v3/page/pagebundle/Wikipedia%3A%E5%89%8A%E9%99%A4%E8%A8%98%E9%8C%B2%2F%E9%81%8E%E5%8E%BB%E3%83%AD%E3%82%B0_2004%E5%B9%B411%E6%9C%88/2368296
Stack Trace
#0 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(258): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/PegTokenizer.php(115): Wikimedia\Parsoid\Utils\PHPUtils::assertValidUTF8(string)
#2 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\PegTokenizer->processChunkily(string, array)
#3 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#4 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/TokenTransformManager.php(189): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#5 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/HTML5TreeBuilder.php(420): Wikimedia\Parsoid\Wt2Html\TokenTransformManager->processChunkily(string, array)
#6 [internal function]: Wikimedia\Parsoid\Wt2Html\HTML5TreeBuilder->processChunkily(string, array)
#7 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/DOMPostProcessor.php(900): Generator->current()
#8 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(152): Wikimedia\Parsoid\Wt2Html\DOMPostProcessor->processChunkily(string, array)
#9 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipeline.php(202): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseChunkily(string, array)
#10 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Wt2Html/ParserPipelineFactory.php(299): Wikimedia\Parsoid\Wt2Html\ParserPipeline->parseToplevelDoc(string, array)
#11 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Core/WikitextContentModelHandler.php(81): Wikimedia\Parsoid\Wt2Html\ParserPipelineFactory->parse(string)
#12 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(161): Wikimedia\Parsoid\Core\WikitextContentModelHandler->toDOM(Wikimedia\Parsoid\Config\Env)
#13 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/src/Parsoid.php(193): Wikimedia\Parsoid\Parsoid->parseWikitext(MWParsoid\Config\PageConfig, array)
#14 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(588): Wikimedia\Parsoid\Parsoid->wikitext2html(MWParsoid\Config\PageConfig, array, NULL)
#15 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/PageHandler.php(88): MWParsoid\Rest\Handler\ParsoidHandler->wt2html(MWParsoid\Config\PageConfig, array)
#16 /srv/mediawiki/php-1.36.0-wmf.11/vendor/wikimedia/parsoid/extension/src/Rest/Handler/ParsoidHandler.php(1047): MWParsoid\Rest\Handler\PageHandler->realExecute()
#17 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(381): MWParsoid\Rest\Handler\ParsoidHandler->execute()
#18 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/Router.php(316): MediaWiki\Rest\Router->executeHandler(MWParsoid\Rest\Handler\PageHandler)
#19 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(155): MediaWiki\Rest\Router->execute(MediaWiki\Rest\RequestFromGlobals)
#20 /srv/mediawiki/php-1.36.0-wmf.11/includes/Rest/EntryPoint.php(119): MediaWiki\Rest\EntryPoint->execute()
#21 /srv/mediawiki/php-1.36.0-wmf.11/rest.php(31): MediaWiki\Rest\EntryPoint::main()
#22 /srv/mediawiki/w/rest.php(3): require(string)
#23 {main}

Event Timeline

ssastry triaged this task as Medium priority.Oct 22 2020, 9:08 PM
ssastry moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.
ssastry renamed this task from Invariant failed: Bad UTF-8 (full string verification) -- bad UTF-8 from database to Bad UTF-8 content in database: UTF-8 errors in PegTokenizer.Dec 9 2020, 2:09 PM