Page MenuHomePhabricator

mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by Lua
Open, MediumPublic

Description

Error

Request URL: https://be-tarask.wikipedia.org/wiki/20th_Century_Fox
Request ID: XTb7DApAIC0AAGlGmg0AAACC

message
[XTb7DApAIC0AAGlGmg0AAACC] /wiki/20th_Century_Fox   Exception from line 1522 of /srv/mediawiki/php-1.34.0-wmf.14/includes/resourceloader/ResourceLoader.php: JSON serialization of config data failed. This usually means the config data is not valid UTF-8.
trace
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/OutputPage.php(3170): ResourceLoader::makeConfigSetScript(array)
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/skins/Skin.php(683): OutputPage->getBottomScripts()
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/skins/SkinTemplate.php(457): Skin->bottomScripts()
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/skins/SkinTemplate.php(217): SkinTemplate->prepareQuickTemplate()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/OutputPage.php(2580): SkinTemplate->outputPage()
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(891): OutputPage->output(boolean)
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(903): Closure$MediaWiki::main()
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(515): MediaWiki->main()
#8 /srv/mediawiki/php-1.34.0-wmf.14/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): include(string)
#10 {main}

Impact

TBD

Notes

I'm assuming this is a simple config change that can be fixed in SWAT. Thus, I'm not treating it as a train blocker, but the problem should still be fixed, so I'm marking it as UBN.

Only happens on https://be-tarask.wikipedia.org/

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 23 2019, 12:48 PM
LarsWirzenius triaged this task as Unbreak Now! priority.Jul 23 2019, 12:49 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 23 2019, 12:49 PM
hashar updated the task description. (Show Details)Jul 23 2019, 1:01 PM
Anomie lowered the priority of this task from Unbreak Now! to Medium.Jul 23 2019, 3:29 PM
Anomie added a subscriber: Anomie.

Not occurring very frequently, so lowering the priority. And not the "simple config change" theorized either.

I couldn't reproduce it with the URL given, but I found that https://be-tarask.wikipedia.org/wiki/Lady_Gaga was reproducing it. It turns out that the cached ParserOutput with key be_x_oldwiki:pcache:idhash:114544-0!canonical and timestamp 20190723103909 and revision id 1836172 included the string init <\320\234\320\276\320\264\321\203\320\273\321\214:\320\222\321\226\320\272\321\226\320\267\321\214\320\262\320\265\321\201\321\202\320\272\321\226/\320\272\320\260\320\275\321\204\321\226\320\263\321\203\321\200\320\260\321\206\321> in $data['scribunto-limitreport-profile'][3][0]. Just before the > is a truncated UTF-8 character. If the page gets reparsed and that module's method doesn't take so much time as to show up in the limit report, the error will go away.

The underlying cause is that Lua 5.1 defines LUA_IDSIZE as 60, so the name gets truncated at 59 bytes by Lua before the data structure is passed to LuaSandbox's luasandbox_timer_profiler_hook(). Unfortunately we probably can't change that without compiling Lua into LuaSandbox (see also T149552), so probably the easiest current fix would be to normalize the strings returned by LuaSandbox's getProfilerFunctionReport() in Scribunto_LuaSandboxEngine::getLimitReportData().

Krinkle renamed this task from Config error in be-tarask wiki (non-UTF8) to mw.config empty on some pages due to non-UTF8 data from LuaSandbox.Jul 23 2019, 3:46 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM
Krinkle added a subscriber: Krinkle.Oct 3 2019, 2:30 AM

Still found in Logstash. Latest url is https://bn.wikipedia.org/w/index.php?printable=yes&mobileaction=toggle_view_desktop&title=%E0%A6%95%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A7%87%E0%A6%AD, which I'm able to repro consistently.

Did some ad-hoc logging on mwdebug1001 and got:

> json_last_error_msg()
"Malformed UTF-8 characters, possibly incorrectly encoded"

> array_keys($configuration)
["wgPageParseReport"]

At that point I found this task and realised the issue was known already.

In T245573 @Catrope suggested another potential way to fix this: cleaning up the string inside LuaSandbox after it comes from Lua, in luasandbox_timer_profiler_hook() and luasandbox_push_structured_trace().

Jdforrester-WMF renamed this task from mw.config empty on some pages due to non-UTF8 data from LuaSandbox to mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by LuaSandbox.Feb 19 2020, 11:55 PM
Anomie renamed this task from mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by LuaSandbox to mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by Lua.Feb 20 2020, 4:23 PM

In T245573 @Catrope suggested another potential way to fix this: cleaning up the string inside LuaSandbox after it comes from Lua, in luasandbox_timer_profiler_hook() and luasandbox_push_structured_trace().

I didn't realize Language::normalize() existed; it's probably easier to use that in LuaSandboxEngine, than to try to fix it in C.