Page MenuHomePhabricator

Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded
Open, Needs TriagePublicPRODUCTION ERROR

Description

Error
normalized_message
[{reqId}] {exception_url}   Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded
exception.trace
from /srv/mediawiki/php-1.40.0-wmf.6/includes/MagicWordArray.php(319)
#0 /srv/mediawiki/php-1.40.0-wmf.6/includes/parser/Parser.php(4101): MagicWordArray->matchAndRemove(string)
#1 /srv/mediawiki/php-1.40.0-wmf.6/includes/parser/Parser.php(1624): Parser->handleDoubleUnderscore(string)
#2 /srv/mediawiki/php-1.40.0-wmf.6/includes/parser/Parser.php(712): Parser->internalParse(string)
#3 /srv/mediawiki/php-1.40.0-wmf.6/includes/content/WikitextContentHandler.php(301): Parser->parse(string, Title, ParserOptions, boolean, boolean, integer)
#4 /srv/mediawiki/php-1.40.0-wmf.6/includes/content/ContentHandler.php(1721): WikitextContentHandler->fillParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams, ParserOutput)
#5 /srv/mediawiki/php-1.40.0-wmf.6/includes/content/Renderer/ContentRenderer.php(47): ContentHandler->getParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams)
#6 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RenderedRevision.php(266): MediaWiki\Content\Renderer\ContentRenderer->getParserOutput(WikitextContent, Title, integer, ParserOptions, boolean)
#7 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RenderedRevision.php(237): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(WikitextContent, boolean)
#8 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RevisionRenderer.php(221): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array)
#9 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RevisionRenderer.php(158): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, array)
#10 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array)
#11 /srv/mediawiki/php-1.40.0-wmf.6/includes/Revision/RenderedRevision.php(199): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array)
#12 /srv/mediawiki/php-1.40.0-wmf.6/includes/poolcounter/PoolWorkArticleView.php(87): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#13 /srv/mediawiki/php-1.40.0-wmf.6/includes/poolcounter/PoolWorkArticleViewCurrent.php(92): PoolWorkArticleView->renderRevision()
#14 /srv/mediawiki/php-1.40.0-wmf.6/includes/poolcounter/PoolCounterWork.php(163): PoolWorkArticleViewCurrent->doWork()
#15 /srv/mediawiki/php-1.40.0-wmf.6/includes/page/ParserOutputAccess.php(299): PoolCounterWork->execute()
#16 /srv/mediawiki/php-1.40.0-wmf.6/includes/page/Article.php(708): MediaWiki\Page\ParserOutputAccess->getParserOutput(WikiPage, ParserOptions, MediaWiki\Revision\RevisionStoreRecord, integer)
#17 /srv/mediawiki/php-1.40.0-wmf.6/includes/page/Article.php(522): Article->generateContentOutput(User, ParserOptions, integer, OutputPage, array)
#18 /srv/mediawiki/php-1.40.0-wmf.6/includes/actions/ViewAction.php(78): Article->view()
#19 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(542): ViewAction->show()
#20 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(322): MediaWiki->performAction(Article, Title)
#21 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(904): MediaWiki->performRequest()
#22 /srv/mediawiki/php-1.40.0-wmf.6/includes/MediaWiki.php(562): MediaWiki->main()
#23 /srv/mediawiki/php-1.40.0-wmf.6/index.php(50): MediaWiki->run()
#24 /srv/mediawiki/php-1.40.0-wmf.6/index.php(46): wfIndexMain()
#25 /srv/mediawiki/w/index.php(3): require(string)
#26 {main}
Impact
Notes

Details

Request URL
https://meta.wikimedia.org/wiki/User_talk:Moosh~metawiki

Event Timeline

Throwing an exception is intentional and comes from f27945728f1dcbc296ff397de68123fe78eec06c which was released with 1.40.0-wmf.5. It is related to T319218.

So looking at https://meta.wikimedia.org/wiki/User_talk:Moosh~metawiki?action=raw

It contains the sentence Bonjour ici également Mosh :-) The é is encoded as 0xE9 which is the iso-8859-1 encoding, not the utf-8 encoding (0xC3A9). The edit was made in 2002. It looks like some subsequent edits were made, but they were all new section adds, so i guess did not trigger the rest of the page to be converted. MetaWiki does not have $wgLegacyEncoding set, so there is no live conversion. Hard to say what happened in the distant past to allow such an encoding error to occur.

In theory any edit to the page should fix the issue (Assuming the edit goes through without triggering the exception). The behaviour before the exception was thrown would be to just show a blank page, which would also be a bad behaviour in the situation.

I took the liberty of fixing the encoding - https://meta.wikimedia.org/w/index.php?title=User_talk%3AMoosh%7Emetawiki&type=revision&diff=23955501&oldid=11957226

Surprisingly, when directly viewing the page, it seemed to treat it as if the page didn't exist instead of an exception. Perhaps some cache was involved.

@Bawolff if you have access to be able to edit the page mentioned in T266129#8317919, please edit. Otherwise, I'll check with someone on staff.

@Bawolff if you have access to be able to edit the page mentioned in T266129#8317919, please edit. Otherwise, I'll check with someone on staff.

Sorry, i don't have the ability to edit that page.

So when i made this patch originally, there was previously code that checked for this error and gave a log warning. I am guessing that those warnings did not make it to logstash, as at the time i saw zero, but there clearly isn't zero.

I guess the question is what to do now?

Should we just ignore these exceptions? After all, the previous state was to just break silently, arguably an exception, even if we don't fix it is better.

I suppose another alternative is to detect the error, and try and force convert the page when the error occurs.

If you go to https://fy.wikipedia.org/wiki/Wikipedy%3AOanbied_log?oldid=20420, you will see that it breaks loudly (and I think also still logs in logstash) and isn't necessarily any better.

I think this is okay for now. If we see lots of exceptions here, we can figure out other strategies of fixing these pages.

Just noting that https://phabricator.wikimedia.org/T331228 is very likely caused by this as well. (Bug report is closed, and according to https://github.com/edwardspec/mediawiki-extension-JsCalendar/issues/7 it's not the fault of the extension)

Just noting that https://phabricator.wikimedia.org/T331228 is very likely caused by this as well. (Bug report is closed, and according to https://github.com/edwardspec/mediawiki-extension-JsCalendar/issues/7 it's not the fault of the extension)

I think its very unlikely these bugs have the same cause (miraheze didn't exist pre utf-8 transition)

In T331228, this error happens inside $parser->recursiveTagParseFully( $row->text )
(where $parser is Parser object, and $row->text is the value of text.old_text field in the MediaWiki database).

Whatever is written in the database, Parser has trouble processing it.

Lets continue discussing this in https://github.com/edwardspec/mediawiki-extension-JsCalendar/issues/7 (or a separate bug, it is possibly not the fault of that extension either but some third cause) to avoid this bug getting too off topic.