Page MenuHomePhabricator

mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by Lua
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request URL: https://be-tarask.wikipedia.org/wiki/20th_Century_Fox
Request ID: XTb7DApAIC0AAGlGmg0AAACC

message
[XTb7DApAIC0AAGlGmg0AAACC] /wiki/20th_Century_Fox   Exception from line 1522 of /srv/mediawiki/php-1.34.0-wmf.14/includes/resourceloader/ResourceLoader.php: JSON serialization of config data failed. This usually means the config data is not valid UTF-8.
trace
#0 /srv/mediawiki/php-1.34.0-wmf.14/includes/OutputPage.php(3170): ResourceLoader::makeConfigSetScript(array)
#1 /srv/mediawiki/php-1.34.0-wmf.14/includes/skins/Skin.php(683): OutputPage->getBottomScripts()
#2 /srv/mediawiki/php-1.34.0-wmf.14/includes/skins/SkinTemplate.php(457): Skin->bottomScripts()
#3 /srv/mediawiki/php-1.34.0-wmf.14/includes/skins/SkinTemplate.php(217): SkinTemplate->prepareQuickTemplate()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/OutputPage.php(2580): SkinTemplate->outputPage()
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(891): OutputPage->output(boolean)
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(903): Closure$MediaWiki::main()
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/MediaWiki.php(515): MediaWiki->main()
#8 /srv/mediawiki/php-1.34.0-wmf.14/index.php(42): MediaWiki->run()
#9 /srv/mediawiki/w/index.php(3): include(string)
#10 {main}

Impact

The page-specific JSON data for client-side JavaScript is consistently unable to be serialised on some articles, due to Scribunto inserting invalid UTF-8 strings into the dataset.

As a result, client-side JavaScript is unable to initialise completely and some features may be broken or (if we're lucky) missing, due to variables that are expected to always be defined, now sometimes not be defined.

Notes

I'm assuming this is a simple config change that can be fixed in SWAT. Thus, I'm not treating it as a train blocker, but the problem should still be fixed, so I'm marking it as UBN.

Only happens on https://be-tarask.wikipedia.org/

Event Timeline

LarsWirzenius triaged this task as Unbreak Now! priority.Jul 23 2019, 12:49 PM
Anomie lowered the priority of this task from Unbreak Now! to Medium.Jul 23 2019, 3:29 PM
Anomie subscribed.

Not occurring very frequently, so lowering the priority. And not the "simple config change" theorized either.

I couldn't reproduce it with the URL given, but I found that https://be-tarask.wikipedia.org/wiki/Lady_Gaga was reproducing it. It turns out that the cached ParserOutput with key be_x_oldwiki:pcache:idhash:114544-0!canonical and timestamp 20190723103909 and revision id 1836172 included the string init <\320\234\320\276\320\264\321\203\320\273\321\214:\320\222\321\226\320\272\321\226\320\267\321\214\320\262\320\265\321\201\321\202\320\272\321\226/\320\272\320\260\320\275\321\204\321\226\320\263\321\203\321\200\320\260\321\206\321> in $data['scribunto-limitreport-profile'][3][0]. Just before the > is a truncated UTF-8 character. If the page gets reparsed and that module's method doesn't take so much time as to show up in the limit report, the error will go away.

The underlying cause is that Lua 5.1 defines LUA_IDSIZE as 60, so the name gets truncated at 59 bytes by Lua before the data structure is passed to LuaSandbox's luasandbox_timer_profiler_hook(). Unfortunately we probably can't change that without compiling Lua into LuaSandbox (see also T149552), so probably the easiest current fix would be to normalize the strings returned by LuaSandbox's getProfilerFunctionReport() in Scribunto_LuaSandboxEngine::getLimitReportData().

Krinkle renamed this task from Config error in be-tarask wiki (non-UTF8) to mw.config empty on some pages due to non-UTF8 data from LuaSandbox.Jul 23 2019, 3:46 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM

Still found in Logstash. Latest url is https://bn.wikipedia.org/w/index.php?printable=yes&mobileaction=toggle_view_desktop&title=%E0%A6%95%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A7%87%E0%A6%AD, which I'm able to repro consistently.

Did some ad-hoc logging on mwdebug1001 and got:

> json_last_error_msg()
"Malformed UTF-8 characters, possibly incorrectly encoded"

> array_keys($configuration)
["wgPageParseReport"]

At that point I found this task and realised the issue was known already.

In T245573 @Catrope suggested another potential way to fix this: cleaning up the string inside LuaSandbox after it comes from Lua, in luasandbox_timer_profiler_hook() and luasandbox_push_structured_trace().

Jdforrester-WMF renamed this task from mw.config empty on some pages due to non-UTF8 data from LuaSandbox to mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by LuaSandbox.Feb 19 2020, 11:55 PM
Anomie renamed this task from mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by LuaSandbox to mw.config empty on some pages (and fatal errors emitted) due to Unicode-unaware handling of UTF8 data by Lua.Feb 20 2020, 4:23 PM

In T245573 @Catrope suggested another potential way to fix this: cleaning up the string inside LuaSandbox after it comes from Lua, in luasandbox_timer_profiler_hook() and luasandbox_push_structured_trace().

I didn't realize Language::normalize() existed; it's probably easier to use that in LuaSandboxEngine, than to try to fix it in C.

Aklapper added a subscriber: AMooney.

@AMooney: Assuming that "Set projects" was accidentally used instead of "Add projects", hence restoring some previous project tags.

Still seen over the past 7 days. About a hundred per day.

Requesing re-triage of production error still seen in production one year later.

Change 640270 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/Scribunto@master] Fix invalid UTF-8 in LuaSandbox profiler data

https://gerrit.wikimedia.org/r/640270

If we ever compile our own Lua package, it would be nice to increase LUA_IDSIZE. It does say you are meant to change it:

/*
@@ LUA_IDSIZE gives the maximum size for the description of the source
@* of a function in debug information.
** CHANGE it if you want a different size.
*/
#define LUA_IDSIZE	60

I don't think it's worth making a new package just for that though. It's used for a fixed-size buffer on the stack, so it can't be infinity, but 300 seems reasonable.

Change 640270 merged by jenkins-bot:
[mediawiki/extensions/Scribunto@master] Fix invalid UTF-8 in LuaSandbox profiler data

https://gerrit.wikimedia.org/r/640270