Page MenuHomePhabricator

Exception: "Malformed UTF-8 characters" in Parser\MagicWordArray (via WikitextContentHandler)
Open, Needs TriagePublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded
exception.trace
from /srv/mediawiki/php-1.40.0-wmf.6/includes/MagicWordArray.php(319)
#0 /srv/mediawiki/php-1.40.0-wmf.6/includes/parser/Parser.php(4101): MagicWordArray->matchAndRemove(string)
#1 /srv/mediawiki/php-1.40.0-wmf.6/includes/parser/Parser.php(1624): Parser->handleDoubleUnderscore(string)
#2 /srv/mediawiki/php-1.40.0-wmf.6/includes/parser/Parser.php(712): Parser->internalParse(string)
#3 /srv/mediawiki/php-1.40.0-wmf.6/includes/content/WikitextContentHandler.php(301): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer)
#4 /srv/mediawiki/php-1.40.0-wmf.6/includes/content/ContentHandler.php(1721): WikitextContentHandler->fillParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams, ParserOutput)
#5 /srv/mediawiki/php-1.40.0-wmf.6/includes/content/Renderer/ContentRenderer.php(47): ContentHandler->getParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams)
#6 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RenderedRevision.php(266): MediaWiki\Content\Renderer\ContentRenderer->getParserOutput(WikitextContent, Title, integer, ParserOptions, boolean)
#7 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RenderedRevision.php(237): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(WikitextContent, boolean)
#8 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RevisionRenderer.php(221): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array)
#9 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RevisionRenderer.php(158): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#10 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array)
#11 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RenderedRevision.php(199): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array)
#12 /srv/mediawiki/php-1.40.0-wmf.6/includes/poolcounter/PoolWorkArticleView.php(87): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#13 /srv/mediawiki/php-1.40.0-wmf.6/includes/poolcounter/PoolWorkArticleViewCurrent.php(92): PoolWorkArticleView->renderRevision()
#14 /srv/mediawiki/php-1.40.0-wmf.6/includes/poolcounter/PoolCounterWork.php(163): PoolWorkArticleViewCurrent->doWork()
#15 /srv/mediawiki/php-1.40.0-wmf.6/includes/page/ParserOutputAccess.php(299): PoolCounterWork->execute()
#16 /srv/mediawiki/php-1.40.0-wmf.6/includes/page/Article.php(708): MediaWiki\Page\ParserOutputAccess->getParserOutput(WikiPage, ParserOptions, MediaWiki\Revision\RevisionStoreRecord, integer)
#17 /srv/mediawiki/php-1.40.0-wmf.6/includes/page/Article.php(522): Article->generateContentOutput(User, ParserOptions, integer, OutputPage, array)
#18 /srv/mediawiki/php-1.40.0-wmf.6/includes/actions/ViewAction.php(78): Article->view()
#19 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(542): ViewAction->show()
#20 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(322): MediaWiki->performAction(Article, Title)
#21 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(904): MediaWiki->performRequest()
#22 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(562): MediaWiki->main()
#23 /srv/mediawiki/php-1.40.0-wmf.6/index.php(50): MediaWiki->run()
#24 /srv/mediawiki/php-1.40.0-wmf.6/index.php(46): wfIndexMain()
#25 /srv/mediawiki/w/index.php(3): require(string)
#26 {main}
Impact
Notes

Event Timeline

Throwing an exception is intentional and comes from f27945728f1dcbc296ff397de68123fe78eec06c which was released with 1.40.0-wmf.5. It is related to T319218.

So looking at https://meta.wikimedia.org/wiki/User_talk:Moosh~metawiki?action=raw

It contains the sentence Bonjour ici également Mosh :-) The é is encoded as 0xE9 which is the iso-8859-1 encoding, not the utf-8 encoding (0xC3A9). The edit was made in 2002. It looks like some subsequent edits were made, but they were all new section adds, so i guess did not trigger the rest of the page to be converted. MetaWiki does not have $wgLegacyEncoding set, so there is no live conversion. Hard to say what happened in the distant past to allow such an encoding error to occur.

In theory any edit to the page should fix the issue (Assuming the edit goes through without triggering the exception). The behaviour before the exception was thrown would be to just show a blank page, which would also be a bad behaviour in the situation.

I took the liberty of fixing the encoding - https://meta.wikimedia.org/w/index.php?title=User_talk%3AMoosh%7Emetawiki&type=revision&diff=23955501&oldid=11957226

Surprisingly, when directly viewing the page, it seemed to treat it as if the page didn't exist instead of an exception. Perhaps some cache was involved.

@Bawolff if you have access to be able to edit the page mentioned in T266129#8317919, please edit. Otherwise, I'll check with someone on staff.

@Bawolff if you have access to be able to edit the page mentioned in T266129#8317919, please edit. Otherwise, I'll check with someone on staff.

Sorry, i don't have the ability to edit that page.

So when i made this patch originally, there was previously code that checked for this error and gave a log warning. I am guessing that those warnings did not make it to logstash, as at the time i saw zero, but there clearly isn't zero.

I guess the question is what to do now?

Should we just ignore these exceptions? After all, the previous state was to just break silently, arguably an exception, even if we don't fix it is better.

I suppose another alternative is to detect the error, and try and force convert the page when the error occurs.

If you go to https://fy.wikipedia.org/wiki/Wikipedy%3AOanbied_log?oldid=20420, you will see that it breaks loudly (and I think also still logs in logstash) and isn't necessarily any better.

I think this is okay for now. If we see lots of exceptions here, we can figure out other strategies of fixing these pages.

Just noting that https://phabricator.wikimedia.org/T331228 is very likely caused by this as well. (Bug report is closed, and according to https://github.com/edwardspec/mediawiki-extension-JsCalendar/issues/7 it's not the fault of the extension)

Just noting that https://phabricator.wikimedia.org/T331228 is very likely caused by this as well. (Bug report is closed, and according to https://github.com/edwardspec/mediawiki-extension-JsCalendar/issues/7 it's not the fault of the extension)

I think its very unlikely these bugs have the same cause (miraheze didn't exist pre utf-8 transition)

In T331228, this error happens inside $parser->recursiveTagParseFully( $row->text )
(where $parser is Parser object, and $row->text is the value of text.old_text field in the MediaWiki database).

Whatever is written in the database, Parser has trouble processing it.

Lets continue discussing this in https://github.com/edwardspec/mediawiki-extension-JsCalendar/issues/7 (or a separate bug, it is possibly not the fault of that extension either but some third cause) to avoid this bug getting too off topic.

In T331228, this error happens inside $parser->recursiveTagParseFully( $row->text )
(where $parser is Parser object, and $row->text is the value of text.old_text field in the MediaWiki database).

Whatever is written in the database, Parser has trouble processing it.

dawiki has a wgLegacyEncoding option set to 'windows-1252', so it is not wrong to find old text, not sure when there is the conversion, T128150

Krinkle renamed this task from Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded to Exception: "Malformed UTF-8 characters" in in Parser\MagicWordArray (via WikitextContentHandler).Jun 5 2023, 10:27 PM
Krinkle renamed this task from Exception: "Malformed UTF-8 characters" in in Parser\MagicWordArray (via WikitextContentHandler) to Exception: "Malformed UTF-8 characters" in Parser\MagicWordArray (via WikitextContentHandler).

This came up again at m:Tech#A diff page on meta.wikimedia.org generates "Internal error" (a report about an oldid link, has been fixed in later revisions). Given the error message (Fatal exception of type "Exception"), one has no clue what’s going on. Could at least the code be changed to throw a more expressively named exception, like InvalidEncodingInContentException, instead of \Exception?

Change 969169 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] WIP: UTF8 content errors: Use a more informative exception name

https://gerrit.wikimedia.org/r/969169

One of the edits I can find via logstash looks like this:

  • In revision 1875001 everything is encoded via ウ sequences.
  • Just a few minutes later the same user uploads revision 1875002. There is no summary line, but it looks like they tried to turn the page into proper UTF-8, and failed. It's hard to tell what went wrong. But things like this happen in a world before UTF-8 became the norm.
  • The user notices and uploads revision 1875003 with the summary line "UTf-8 ka?" This is finally proper UTF-8.

All that happened in September 2002.

My impression is that this has almost nothing to do with the MagicWordArray class. Let's say we somehow manage to fix this class and e.g. make it silently skip broken wikitext like this. What then? There will certainly be just another preg_… call somewhere else that fails for the same reason. Do we want to fix them all? Maybe we should.

Change 969301 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/core@master] Make MagicWordArray not fail on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/969301

Change 969310 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/core@master] Rewrite MagicWordArray::matchAndRemove to use single preg_… call

https://gerrit.wikimedia.org/r/969310

Change 969169 abandoned by Subramanya Sastry:

[mediawiki/core@master] MagicWordArray utf-8 content errors: Use a more informative exception

Reason:

https://gerrit.wikimedia.org/r/969169

Change 969301 merged by jenkins-bot:

[mediawiki/core@master] Make MagicWordArray not fail on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/969301

Change 982838 had a related patch set uploaded (by Paladox; author: Thiemo Kreuz (WMDE)):

[mediawiki/core@REL1_40] Make MagicWordArray not fail on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/982838

Change 982838 merged by jenkins-bot:

[mediawiki/core@REL1_40] Make MagicWordArray not fail on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/982838

Change 982839 had a related patch set uploaded (by Paladox; author: Thiemo Kreuz (WMDE)):

[mediawiki/core@REL1_41] Make MagicWordArray not fail on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/982839

Change 982839 merged by jenkins-bot:

[mediawiki/core@REL1_41] Make MagicWordArray not fail on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/982839

Change 1007890 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/core@master] No need to crash the Parser on old revs with broken UTF-8

https://gerrit.wikimedia.org/r/1007890

Change #969310 merged by jenkins-bot:

[mediawiki/core@master] Rewrite MagicWordArray::matchAndRemove to use single preg_… call

https://gerrit.wikimedia.org/r/969310