Page MenuHomePhabricator

Scribunto_LuaSandboxInterpreter::callFunction: LuaSandboxFunction::call returned false
Closed, ResolvedPublicPRODUCTION ERROR

Description

2017-05-25 10:23:50 [WSawtgpAIDYAADAqCrEAAABK] mw1184 frwiki 1.30.0-wmf.1 exception ERROR: [WSawtgpAIDYAADAqCrEAAABK] /wiki/Cat%C3%A9gorie:D%C3%A9c%C3%A8s_en_novembre_1698   MWException from line 335 of /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php: Scribunto_LuaSandboxInterpreter::callFunction: LuaSandboxFunction::call returned false {"exception_id":"WSawtgpAIDYAADAqCrEAAABK","caught_by":"mwe_handler"} 
[Exception MWException] (/srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php:335) Scribunto_LuaSandboxInterpreter::callFunction: LuaSandboxFunction::call returned false
  #0 /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(154): Scribunto_LuaSandboxInterpreter->callFunction(LuaSandboxFunction, array)
  #1 /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaCommon/SiteLibrary.php(91): Scribunto_LuaEngine->registerInterface(string, array, array)
  #2 /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(488): Scribunto_LuaSiteLibrary->register()
  #3 /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaCommon/LuaCommon.php(125): Scribunto_LuaEngine->instantiatePHPLibrary(string, string, boolean)
  #4 /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php(37): Scribunto_LuaEngine->load()
  #5 /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/common/Hooks.php(115): Scribunto_LuaSandboxEngine->getResourceUsage(integer)
  #6 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(3408): ScribuntoHooks::invokeHook(Parser, PPTemplateFrame_Hash, array)
  #7 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(3128): Parser->callParserFunction(PPTemplateFrame_Hash, string, array)
  #8 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Preprocessor_Hash.php(1039): Parser->braceSubstitution(array, PPTemplateFrame_Hash)
  #9 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(3282): PPFrame_Hash->expand(PPNode_Hash_Tree)
  #10 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Preprocessor_Hash.php(1039): Parser->braceSubstitution(array, PPTemplateFrame_Hash)
  #11 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Preprocessor_Hash.php(1472): PPFrame_Hash->expand(PPNode_Hash_Tree, integer)
  #12 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(3279): PPTemplateFrame_Hash->cachedExpand(string, PPNode_Hash_Tree)
  #13 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Preprocessor_Hash.php(1039): Parser->braceSubstitution(array, PPFrame_Hash)
  #14 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(2942): PPFrame_Hash->expand(PPNode_Hash_Tree, integer)
  #15 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(1297): Parser->replaceVariables(string)
  #16 /srv/mediawiki/php-1.30.0-wmf.1/includes/parser/Parser.php(451): Parser->internalParse(string)
  #17 /srv/mediawiki/php-1.30.0-wmf.1/includes/cache/MessageCache.php(1172): Parser->parse(string, Title, ParserOptions, boolean)
  #18 /srv/mediawiki/php-1.30.0-wmf.1/includes/Message.php(1221): MessageCache->parse(string, Title, boolean, boolean, Language)
  #19 /srv/mediawiki/php-1.30.0-wmf.1/includes/Message.php(872): Message->parseText(string)
  #20 /srv/mediawiki/php-1.30.0-wmf.1/includes/Message.php(958): Message->toString(string)
  #21 /srv/mediawiki/php-1.30.0-wmf.1/includes/CategoryViewer.php(751): Message->parseAsBlock()
  #22 /srv/mediawiki/php-1.30.0-wmf.1/includes/CategoryViewer.php(414): CategoryViewer->getCountMessage(integer, string, string)
  #23 /srv/mediawiki/php-1.30.0-wmf.1/includes/CategoryViewer.php(119): CategoryViewer->getSubcategorySection()
  #24 /srv/mediawiki/php-1.30.0-wmf.1/includes/page/CategoryPage.php(117): CategoryViewer->getHTML()
  #25 /srv/mediawiki/php-1.30.0-wmf.1/includes/page/CategoryPage.php(72): CategoryPage->closeShowCategory()
  #26 /srv/mediawiki/php-1.30.0-wmf.1/includes/actions/ViewAction.php(68): CategoryPage->view()
  #27 /srv/mediawiki/php-1.30.0-wmf.1/includes/MediaWiki.php(499): ViewAction->show()
  #28 /srv/mediawiki/php-1.30.0-wmf.1/includes/MediaWiki.php(293): MediaWiki->performAction(CategoryTreeCategoryPage, Title)
  #29 /srv/mediawiki/php-1.30.0-wmf.1/includes/MediaWiki.php(862): MediaWiki->performRequest()
  #30 /srv/mediawiki/php-1.30.0-wmf.1/includes/MediaWiki.php(523): MediaWiki->main()
  #31 /srv/mediawiki/php-1.30.0-wmf.1/index.php(43): MediaWiki->run()
  #32 /srv/mediawiki/w/index.php(3): include(string)
  #33 {main}

Event Timeline

Unfortunately, this doesn't tell us much. Visiting https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:D%C3%A9c%C3%A8s_en_novembre_1698 doesn't seem to re-trigger the error, nor does purging the page.

Do we log any of PHP's warnings that should correlate with these exceptions?

Yes, they're just in a different file (hhvm.log) so you have to manually line up request times/servers. At a very quick glance, I noticed that
LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329

is showing up in fatalmonitor, unsure if that's related or a separate issue.

Looks like the corresponding warnings for the exception in the topic are probably

May 25 10:23:50 mw1184:  #012Warning: LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329
May 25 10:23:50 mw1184:  [Thu May 25 10:23:50 2017] [hphp] [...] [] \nWarning: LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329
May 25 10:23:50 mw1184:  #012Warning: LuaSandboxFunction::call(): unable to convert argument 1 to a lua value in /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329
May 25 10:23:50 mw1184:  [Thu May 25 10:23:50 2017] [hphp] [...] [] \nWarning: LuaSandboxFunction::call(): unable to convert argument 1 to a lua value in /srv/mediawiki/php-1.30.0-wmf.1/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329

Given those warnings, what seems to be going on in this particular stack trace is that the $info being constructed in Scribunto_LuaSiteLibrary::register() is somehow getting an object or array that contains itself. Any similar stack trace that has "SiteLibrary.php(91)" at #1 is probably the same thing.

But nothing in that method jumps out at me as being at all likely to produce such an object/array.

Should we add some debug logging then?

Change 358988 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Scribunto@wmf/1.30.0-wmf.5] Add some logging to debug T166348

https://gerrit.wikimedia.org/r/358988

Change 358988 merged by jenkins-bot:
[mediawiki/extensions/Scribunto@wmf/1.30.0-wmf.5] Add some logging to debug T166348

https://gerrit.wikimedia.org/r/358988

Change 358993 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Scribunto@wmf/1.30.0-wmf.5] Add some logging to debug T166348, take 2

https://gerrit.wikimedia.org/r/358993

Change 358993 merged by jenkins-bot:
[mediawiki/extensions/Scribunto@wmf/1.30.0-wmf.5] Add some logging to debug T166348, take 2

https://gerrit.wikimedia.org/r/358993

Change 361508 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Scribunto@wmf/1.30.0-wmf.6] Add some logging to debug T166348

https://gerrit.wikimedia.org/r/361508

Change 361508 merged by jenkins-bot:
[mediawiki/extensions/Scribunto@wmf/1.30.0-wmf.6] Add some logging to debug T166348

https://gerrit.wikimedia.org/r/361508

Anomie added a subscriber: tstarling.

Looking at the logged data, it appears that there's some sort of a false positive in the luasandbox extension's recursion detection. I'm seeing it report the error when passing something as simple as [ "lang" => "ru" ].

I don't see any reason why it would false-positive though, unless it's possible for something to somehow throw an exception and exit from the middle of luasandbox_push_hashtable so it misses removing the entry from protectionSet, then later it happens to use the same pointer address for a different HashTable and hits the false-positive. Or unless IMPLEMENT_THREAD_LOCAL doesn't actually work.

mmodell raised the priority of this task from High to Unbreak Now!.Jun 28 2017, 9:28 PM

This will block the train from moving tomorrow.

It seems like a strange decision to block the train now when this has been going on since March 29 or earlier (exception.log-20170329.gz is the earliest available on mwlog1001).

Anomie: the logspam level has just been increasing and it's downing out other errors. It's gotten very frequent since wmf.6 so something has changed recently that is triggering this more often.

I would normally let @greg make the call but he's on vacation this week. @demon, @thcipriani, what do you think?

The top 3 errors on my fatalmonitor dashboard are

Warning: LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.6/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329 	31
Warning: LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.7/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 312 	14
Warning: LuaSandboxFunction::call(): unable to convert argument 1 to a lua value in /srv/mediawiki/php-1.30.0-wmf.6/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 329 	6

(those counts are for a 30 minute window)

I could just filter these but we've made a policy of zero tolerance for logspam. The decision was to block the train on any significant new Wikimedia-production-error events that are introduced with a deployment. I'm not sure how this one flew under the radar for so long but it looks significant to me as of wmf.6.

So ~60 per hour in wmf.6 and half that in wmf.7 which is only at group1.

The decision was to block the train on any significant new Wikimedia-production-error events that are introduced with a deployment.

Except this wasn't introduced with a deployment, at least not any deployment in the past three months.

If my suspicions[1] are correct, no change deployed with the train caused these and no change deployed with the train is going fix these log messages. It'll need a new version of the hhvm-luasandbox package built and deployed to fix the problem.

[1]: After looking at it a bit more, I'm pretty sure the "unless it's possible for something to somehow throw an exception and exit from the middle of luasandbox_push_hashtable" I mentioned in T166348#3382705 is indeed possible.

mmodell lowered the priority of this task from Unbreak Now! to High.Jun 29 2017, 3:11 PM

Looking at the logged data, it appears that there's some sort of a false positive in the luasandbox extension's recursion detection. I'm seeing it report the error when passing something as simple as [ "lang" => "ru" ].

I don't see any reason why it would false-positive though, unless it's possible for something to somehow throw an exception and exit from the middle of luasandbox_push_hashtable so it misses removing the entry from protectionSet, then later it happens to use the same pointer address for a different HashTable and hits the false-positive. Or unless IMPLEMENT_THREAD_LOCAL doesn't actually work.

My C++ is too terrible/non-existant to diagnose this; however, the error rate climb does seem to line-up with the rollout of HHVM 3.18, and it's entirely possible this error is coming from hhvm-luasandbox (https://github.com/wikimedia/mediawiki-php-luasandbox/blob/fe4addd002163a33101e19c9c40d260b0ae6ee52/data_conversion.c#L210). @Anomie your theory is that some interface in HHVM changed which means that protectionSet is not being managed properly (i.e., not thread-safe or luasandbox_push_hashtable is blowing up somewhere we can't see).

If this is the case, there's certainly nothing we can do about this error message from the perspective of the train. Maybe @MoritzMuehlenhoff and @tstarling can verify the problem/take a look at the extension's use of HHVM bits?

As far as I can tell it was happening even before HHVM 3.18. In exception.log-20170330.gz, I see the error logged by mw1189, mw1229, mw1235, mw1280, mw1281, and mw1282. According to https://wikitech.wikimedia.org/wiki/Server_Admin_Log, however, mw1189 got HHVM 3.18 on May 11, mw1229 and mw1235 got it on June 12, and the other three got it on June 13.

Change 362301 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/php/luasandbox@master] Redo PHP->Lua data structure recursion protection

https://gerrit.wikimedia.org/r/362301

As far as I can tell it was happening even before HHVM 3.18. In exception.log-20170330.gz, I see the error logged by mw1189, mw1229, mw1235, mw1280, mw1281, and mw1282. According to https://wikitech.wikimedia.org/wiki/Server_Admin_Log, however, mw1189 got HHVM 3.18 on May 11, mw1229 and mw1235 got it on June 12, and the other three got it on June 13.

Yeah, I doublechecked the apt logs, these we all running HHVM 3.12 in 20170330

Change 364268 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] scap: logstash_checker ignore for T166348

https://gerrit.wikimedia.org/r/364268

Change 364268 merged by Andrew Bogott:
[operations/puppet@production] scap: logstash_checker ignore for T166348

https://gerrit.wikimedia.org/r/364268

Change 362301 merged by jenkins-bot:
[mediawiki/php/luasandbox@master] Redo PHP->Lua data structure recursion protection

https://gerrit.wikimedia.org/r/362301

Change 362301 merged by jenkins-bot:
[mediawiki/php/luasandbox@master] Redo PHP->Lua data structure recursion protection

https://gerrit.wikimedia.org/r/362301

Next step is for someone to build an updated deb for luasandbox and push it out, then we can see whether the warnings go away (and stay gone, I suspect restarting HHVM on a server would also stop/reduce the warnings for a short time).

HHVM-Luasandbox has been upgraded to 2.0.13 in production, let's see whether this fixes the bug.

I see on fluorine mwlog1001 that the "LuaSandboxFunction::call(): recursion detected", "LuaSandboxFunction::call(): unable to convert argument 1 to a lua value", and "LuaSandboxFunction::call returned false" messages have not been logged since 09:56:57 UTC. That's a good sign.

Anomie closed this task as Resolved.EditedAug 28 2017, 7:16 PM
Anomie claimed this task.

No instances of the log messages since 2017-08-23 09:56:57 UTC. Let's call this fixed.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM