Page MenuHomePhabricator

Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at start of string (parsoid/src/Html2Wt/SerializerState.php)
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
labels.normalized_message
[{reqId}] {exception_url}   Wikimedia\Assert\InvariantException: Invariant failed: Bad UTF-8 at start of string
error.stack_trace
from /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/assert/src/Assert.php(231)
#0 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Utils/PHPUtils.php(172): Wikimedia\Assert\Assert::invariant(boolean, string)
#1 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Tokens/SourceRange.php(85): Wikimedia\Parsoid\Utils\PHPUtils::safeSubstr(string, integer, integer)
#2 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/SerializerState.php(409): Wikimedia\Parsoid\Tokens\SourceRange->substr(string)
#3 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/Separators.php(899): Wikimedia\Parsoid\Html2Wt\SerializerState->getOrigSrc(Wikimedia\Parsoid\Tokens\SourceRange)
#4 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/SerializerState.php(571): Wikimedia\Parsoid\Html2Wt\Separators->buildSep(Wikimedia\Parsoid\DOM\Element)
#5 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/SerializerState.php(614): Wikimedia\Parsoid\Html2Wt\SerializerState->emitSepForNode(Wikimedia\Parsoid\DOM\Element)
#6 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/WikitextSerializer.php(1206): Wikimedia\Parsoid\Html2Wt\SerializerState->emitChunk(Wikimedia\Parsoid\Html2Wt\ConstrainedText\ConstrainedText, Wikimedia\Parsoid\DOM\Element)
#7 [internal function]: Wikimedia\Parsoid\Html2Wt\WikitextSerializer->serializeNodeInternal(Wikimedia\Parsoid\DOM\Element, Wikimedia\Parsoid\Html2Wt\DOMHandlers\HeadingHandler)
#8 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/WikitextSerializer.php(1331): call_user_func(array, Wikimedia\Parsoid\DOM\Element, Wikimedia\Parsoid\Html2Wt\DOMHandlers\HeadingHandler)
#9 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/SerializerState.php(718): Wikimedia\Parsoid\Html2Wt\WikitextSerializer->serializeNode(Wikimedia\Parsoid\DOM\Element)
#10 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/SerializerState.php(744): Wikimedia\Parsoid\Html2Wt\SerializerState->serializeChildren(Wikimedia\Parsoid\DOM\Element, NULL)
#11 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/WikitextSerializer.php(1595): Wikimedia\Parsoid\Html2Wt\SerializerState->kickOffSerialize(Wikimedia\Parsoid\DOM\Element)
#12 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Html2Wt/SelectiveSerializer.php(227): Wikimedia\Parsoid\Html2Wt\WikitextSerializer->serializeDOM(Wikimedia\Parsoid\DOM\Element, boolean)
#13 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Wikitext/ContentModelHandler.php(229): Wikimedia\Parsoid\Html2Wt\SelectiveSerializer->serializeDOM(Wikimedia\Parsoid\DOM\Document)
#14 /srv/mediawiki/php-1.43.0-wmf.16/vendor/wikimedia/parsoid/src/Parsoid.php(391): Wikimedia\Parsoid\Wikitext\ContentModelHandler->fromDOM(Wikimedia\Parsoid\Ext\ParsoidExtensionAPI, Wikimedia\Parsoid\Core\SelserData)
#15 /srv/mediawiki/php-1.43.0-wmf.16/includes/parser/Parsoid/HtmlToContentTransform.php(613): Wikimedia\Parsoid\Parsoid->dom2wikitext(MediaWiki\Parser\Parsoid\Config\PageConfig, Wikimedia\Parsoid\DOM\Document, array, Wikimedia\Parsoid\Core\SelserData)
#16 /srv/mediawiki/php-1.43.0-wmf.16/includes/parser/Parsoid/HtmlToContentTransform.php(593): MediaWiki\Parser\Parsoid\HtmlToContentTransform->htmlToText()
#17 /srv/mediawiki/php-1.43.0-wmf.16/includes/Rest/Handler/Helper/HtmlInputTransformHelper.php(562): MediaWiki\Parser\Parsoid\HtmlToContentTransform->htmlToContent()
#18 /srv/mediawiki/php-1.43.0-wmf.16/extensions/VisualEditor/includes/DirectParsoidClient.php(203): MediaWiki\Rest\Handler\Helper\HtmlInputTransformHelper->getContent()
#19 /srv/mediawiki/php-1.43.0-wmf.16/extensions/VisualEditor/includes/ApiParsoidTrait.php(121): MediaWiki\Extension\VisualEditor\DirectParsoidClient->transformHTML(MediaWiki\Title\Title, LanguageAr, string, integer, string)
#20 /srv/mediawiki/php-1.43.0-wmf.16/extensions/VisualEditor/includes/ApiVisualEditorEdit.php(251): MediaWiki\Extension\VisualEditor\ApiVisualEditorEdit->transformHTML(MediaWiki\Title\Title, string, integer, string)
#21 /srv/mediawiki/php-1.43.0-wmf.16/extensions/VisualEditor/includes/ApiVisualEditorEdit.php(225): MediaWiki\Extension\VisualEditor\ApiVisualEditorEdit->getWikitextNoCache(MediaWiki\Title\Title, array, array)
#22 /srv/mediawiki/php-1.43.0-wmf.16/extensions/VisualEditor/includes/ApiVisualEditorEdit.php(413): MediaWiki\Extension\VisualEditor\ApiVisualEditorEdit->getWikitext(MediaWiki\Title\Title, array, array)
#23 /srv/mediawiki/php-1.43.0-wmf.16/includes/api/ApiMain.php(1954): MediaWiki\Extension\VisualEditor\ApiVisualEditorEdit->execute()
#24 /srv/mediawiki/php-1.43.0-wmf.16/includes/api/ApiMain.php(930): ApiMain->executeAction()
#25 /srv/mediawiki/php-1.43.0-wmf.16/includes/api/ApiMain.php(901): ApiMain->executeActionWithErrorHandling()
#26 /srv/mediawiki/php-1.43.0-wmf.16/includes/api/ApiEntryPoint.php(153): ApiMain->execute()
#27 /srv/mediawiki/php-1.43.0-wmf.16/includes/MediaWikiEntryPoint.php(200): MediaWiki\Api\ApiEntryPoint->execute()
#28 /srv/mediawiki/php-1.43.0-wmf.16/api.php(44): MediaWiki\MediaWikiEntryPoint->run()
#29 /srv/mediawiki/w/api.php(3): require(string)
#30 {main}
Notes

First seen in wmf.16. Other errors with the same generic symptom ("Invariant failed: Bad UTF-8 at start of string") were present in previous versions, but are unrelated.

We're currently seeing ~1K of these a day, which adds a bit of noise to the logs.

It looks like this change could be the cause: https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1057922

Event Timeline

It looks like this change could be the cause: https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/1057922

It's from this patch, yes,
https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/540620/8

but the difference is going from substr to PHPUtils::safeSubstr which adds the assertion. In other words, it's likely surfacing the issue, rather than being the cause.

jnuche triaged this task as Unbreak Now! priority.Aug 8 2024, 10:27 AM

We got a continuous stream of this error and of T371617 after deploying wmf.17 to group2:

image.png (438×1 px, 68 KB)

Out of caution, I'm rolling back the train and setting these two errors as train blockers until someone can verify the actual impact

The root cause is usually old revisions with bad UTF-8, so the volume tends to vary based on how many folks are digging through back history.

I'll take a look to see if I can narrow things down more specifically, but as @arlo says this isn't a new issue, just a new log.

I'll take a look to see if I can narrow things down more specifically, but as @arlo says this isn't a new issue, just a new log.

Would it be possible to add more specific information about the error to the logged message field?

The data in the log entry is pretty generic: message is "Invariant failed: Bad UTF-8 at start of string", source file is "assert/src/Assert.php", etc. The only place with discriminating info is the stacktrace itself, but that field is not indexed and we cannot add filters for it on the dashboards. So the errors are creating a significant level of noise in the logs that we can't filter out at the moment or more precisely, I'm worried about filtering out too much by accident.

I'm having trouble reproducing the stack trace in the example; it's a html-to-wikitext conversion driven by an edit in Visual Editor. I suspect the issue is the following heading:

== أعراض الألم النفسي<ref>{{استشهاد ويب|عنوان=الألم النفسي: أسبابه، وأعراضه، وعلاجه|مسار=https://www.annajah.net/%D8%A7%D9%84%D8%A3%D9%84%D9%85-%D8%A7%D9%84%D9%86%D9%81%D8%B3%D9%8A-%D8%A3%D8%B3%D8%A8%D8%A7%D8%A8%D9%87-%D9%88%D8%A3%D8%B9%D8%B1%D8%A7%D8%B6%D9%87-%D9%88%D8%B9%D9%84%D8%A7%D8%AC%D9%87-article-33779|تاريخ الوصول=2024-08-06|صحيفة=النجاح|لغة=ar-SA}}</ref><ref>{{استشهاد ويب|عنوان=الألم النفسي وكيفية التعامل معه|مسار=https://altibbi.com/%D9%85%D9%82%D8%A7%D9%84%D8%A7%D8%AA-%D8%B7%D8%A8%D9%8A%D8%A9/%D8%A7%D9%84%D8%B5%D8%AD%D8%A9-%D8%A7%D9%84%D9%86%D9%81%D8%B3%D9%8A%D8%A9/%D8%A7%D9%84%D8%A7%D9%84%D9%85-%D8%A7%D9%84%D9%86%D9%81%D8%B3%D9%8A-9171|تاريخ الوصول=2024-08-06|صحيفة=الطبي|مؤلف=Altibbi.com|لغة=ar-US}}</ref> ==

which is quite an impressive mess.

I'll take a look to see if I can narrow things down more specifically, but as @arlo says this isn't a new issue, just a new log.

Would it be possible to add more specific information about the error to the logged message field?

The data in the log entry is pretty generic: message is "Invariant failed: Bad UTF-8 at start of string", source file is "assert/src/Assert.php", etc. The only place with discriminating info is the stacktrace itself, but that field is not indexed and we cannot add filters for it on the dashboards. So the errors are creating a significant level of noise in the logs that we can't filter out at the moment or more precisely, I'm worried about filtering out too much by accident.

I'd suggest filtering on the url field, /w/api.php, which also helps isolate "problems we can realistically debug" (ie, things which come from a page render) from "things we can't realistically debug" (ie, UTF-8 issues stemming from bad user input).

I'd suggest filtering on the url field, /w/api.php, which also helps isolate "problems we can realistically debug" (ie, things which come from a page render) from "things we can't realistically debug" (ie, UTF-8 issues stemming from bad user input).

That sounds good, it will at least limit the filters to Parsoid calls.

@cscott thanks a lot for taking a look, it sounds like we are reasonably confident nothing has changed other than the logging itself. So I'm going to make the call and roll the train forward again.

jnuche lowered the priority of this task from Unbreak Now! to Needs Triage.Aug 8 2024, 12:28 PM

About 9150 entries in the last 2 weeks ... about 1 every 2 minutes which is higher than I would like. I am not convinced this is actually bad UTF in revisions. This is across a number of different wikis (jawiki contributing almost 45% of the errors).

But, I am having a hard time reproducing this with some random edits on some of the pages in the logs. Need to figure out a different strategy and/or dump more logs since this is clearly user-activity triggered (edits).

Digging into the exception traces a bit more, it looks like these 4 lines account for pretty much every exception (except 7):

  • "Separators.php(899)"
  • "Separators.php(902)"
  • "Separators.php(907)"
  • "Separators.php(642)"

Of those the last one has only 7 instances as well. So, it is really the first three we should focus on first. Those all construct new DSR ranges by doing offset arithmetic by looking at DSR values of two nodes. So, that is a clue to inspect more and if the code guarantees that those constructed offsets are always guaranteed to yield valid UTF8 offsets or if the code needs to have other checks in place to avoid these exception scenarios. I think the code there tries to make best-faith efforts to infer a separator string and rejects the string if it is not a valid separator - so any invalid offsets should not cause exceptions but instead feed the valid separator check. We'll take the followup investigation off-phan now and that might yield a patch.

Change #1062776 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: html2wt: Add option to skip utf8-validity checks

https://gerrit.wikimedia.org/r/1062776

Change #1063059 had a related patch set uploaded (by C. Scott Ananian; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Html2Wt: don't construct substrings with bad UTF-8

https://gerrit.wikimedia.org/r/1063059

Change #1063060 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Html2wt: Fix string creation with bad UTF-8 by validating offsets in DSRs

https://gerrit.wikimedia.org/r/1063060

Change #1063059 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Html2Wt: don't construct substrings with bad UTF-8

https://gerrit.wikimedia.org/r/1063059

Change #1062776 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Html2Wt: Remove substring UTF-8 checks where they are unsafe

https://gerrit.wikimedia.org/r/1062776

Change #1063811 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a17

https://gerrit.wikimedia.org/r/1063811

Change #1063811 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a17

https://gerrit.wikimedia.org/r/1063811

Change #1063060 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Html2wt: Fix string creation with bad UTF-8 by validating offsets in DSRs

https://gerrit.wikimedia.org/r/1063060

Change #1066683 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a18

https://gerrit.wikimedia.org/r/1066683

Change #1066683 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a18

https://gerrit.wikimedia.org/r/1066683