Page MenuHomePhabricator

LuaSandbox should clean up short_src for multibyte-safe truncation
Closed, DuplicatePublic

Description

We've recently been seeing errors like:

[XkxDVwpAICkAAJlsYmYAAAAC] /wiki/%D0%A4%D1%80%D1%8D%D0%B4%D1%8D%D1%80%D1%8B%D0%BA_%D0%9F%D0%B0%D1%81%D1%96   Exception from line 1570 of /srv/mediawiki/php-1.35.0-wmf.19/includes/resourceloader/ResourceLoader.php: JSON serialization of config data failed. This usually means the config data is not valid UTF-8.

#0 /srv/mediawiki/php-1.35.0-wmf.19/includes/OutputPage.php(3181): ResourceLoader::makeConfigSetScript(array)
#1 /srv/mediawiki/php-1.35.0-wmf.19/includes/skins/Skin.php(687): OutputPage->getBottomScripts()
#2 /srv/mediawiki/php-1.35.0-wmf.19/includes/skins/SkinTemplate.php(459): Skin->bottomScripts()
#3 /srv/mediawiki/php-1.35.0-wmf.19/includes/skins/SkinTemplate.php(217): SkinTemplate->prepareQuickTemplate()
#4 /srv/mediawiki/php-1.35.0-wmf.19/includes/OutputPage.php(2593): SkinTemplate->outputPage()
#5 /srv/mediawiki/php-1.35.0-wmf.19/includes/MediaWiki.php(978): OutputPage->output(boolean)
#6 /srv/mediawiki/php-1.35.0-wmf.19/includes/MediaWiki.php(991): MediaWiki->{closure}()
#7 /srv/mediawiki/php-1.35.0-wmf.19/includes/MediaWiki.php(534): MediaWiki->main()
#8 /srv/mediawiki/php-1.35.0-wmf.19/index.php(47): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): require(string)
#10 {main}

This happens because the Scribunto portion of the parser limit report contains invalid UTF-8. The profiler report contains strings like function_name <pagename:123>, which are generated in luasandbox_timer_profiler_hook using short_src, which is a truncated version of the page name. It appears that Lua internally generates this by truncating the page name to 60 bytes, without being aware of multibyte characters. When the truncation splits a multibyte character (which seems like it wouldn't happen because 60 is a multiple of 2, 3 and 4, but it's guaranteed to happen if there's a / in the name), it leaves behind invalid UTF-8, which then causes problems later on, including the exception above.

The non-multibyte aware truncation is done by Lua itself, so we don't control that in LuaSandbox. But we could do a multibyte-aware truncation on short_src in LuaSandbox, using logic similar to what we do in PHP in Language::truncateInternal().