Page MenuHomePhabricator

Stack overflow when Redis is down
Closed, DeclinedPublic

Description

As seen on beta:
https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=general&format=json

2018-01-16 23:40:04 [Wl6NVApEFhUAAAMB6D4AAAAC] deployment-mediawiki05 dewiki 1.31.0-alpha fatal ERROR: [ef765e72] PHP Fatal Error: Stack overflow {"fatal_exception":{"class":"ErrorException","message":"PHP Fatal Error: Stack overflow","code":16777217,"file":"","line":0,"trace":{"0":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"1":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"2":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"3":{"function":"auth","class":"Redis","type":"->","args":["string"]},"4":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"5":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"6":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"7":{"function":"auth","class":"Redis","type":"->","args":["string"]},"8":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"9":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"10":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"11":{"function":"auth","class":"Redis","type":"->","args":["string"]},"12":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"13":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"14":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"15":{"function":"auth","class":"Redis","type":"->","args":["string"]},"16":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"17":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"18":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"19":{"function":"auth","class":"Redis","type":"->","args":["string"]},"20":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"21":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"22":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"23":{"function":"auth","class":"Redis","type":"->","args":["string"]},"24":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"25":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"26":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"27":{"function":"auth","class":"Redis","type":"->","args":["string"]},"28":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"29":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"30":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"31":{"function":"auth","class":"Redis","type":"->","args":["string"]},"32":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"33":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"34":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"35":{"function":"auth","class":"Redis","type":"->","args":["string"]},"36":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"37":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"38":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},
...

This is causing siteinfo API requests (probably all API requests) to fail, which is causing the selenium-CentralAuth browsertest (probably a bunch of others as well) to fail. Also kills the job queue presumably.

There are two issues:

  • MediaWiki can't connect to Redis in Beta
  • HHVM is handling the connection error poorly

Event Timeline

Apparently HHVM's Redis implementation goes into an infinite loop when the Redis server is down :/

The redis-server service was down on redis-01/02, I did a manual restart, but that only changed the methods involved in the loop somewhat:

2018-01-17 00:09:03 [Wl6UHwpEFhUAAAMB6EYAAAAC] deployment-mediawiki05 enwiki 1.31.0-alpha fatal ERROR: [5214e78a] PHP Fatal Error: Stack overflow {"fatal_exception":{"class":"ErrorException","message":"PHP Fatal Error: Stack overflow","code":16777217,"file":"","line":0,"trace":{"0":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"1":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"2":{"function":"auth","class":"Redis","type":"->","args":["string"]},"3":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"4":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"5":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"6":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"7":{"function":"auth","class":"Redis","type":"->","args":["string"]},"8":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"9":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"10":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"11":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"12":{"function":"auth","class":"Redis","type":"->","args":["string"]},"13":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"14":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"15":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"16":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"17":{"function":"auth","class":"Redis","type":"->","args":["string"]},"18":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"19":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"20":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"21":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"22":{"function":"auth","class":"Redis","type":"->","args":["string"]},"23":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"24":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"25":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"26":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"27":{"function":"auth","class":"Redis","type":"->","args":["string"]},"28":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"29":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"30":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"31":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"32":{"function":"auth","class":"Redis","type":"->","args":["string"]},"33":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"34":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"35":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"36":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"37":{"function":"auth","class":"Redis","type":"->","args":["string"]},"38":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"39":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"40":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"41":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"42":{"function":"auth","class":"Redis","type":"->","args":["string"]},"43":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"44":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"45":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]}

The first loop is

Redis::processArrayCommand
Redis::checkConnection
Redis::auth
Redis::processCommand
Redis::processArrayCommand
...

The second loop is

Redis::auth
Redis::processBooleanResponse
Redis::sockReadData
Redis::sockReadLine
Redis::checkConnection
Redis::auth
...

Caused by some issue with the redis servers:

redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <password> get test
(error) ERR Connection refused

This is different from the error given on a wrong password.

greg added subscribers: aaron, greg.

Asking for help from @aaron (kind of object cache) and SRE / Performance-Team to diagnose this. Ori was previously the best person to ask re these things.

greg triaged this task as High priority.Jan 17 2018, 4:49 PM

This is almost UBN! per "This is causing siteinfo API requests (probably all API requests) to fail, which is causing the selenium-CentralAuth browsertest (probably a bunch of others as well) to fail."

Is this task about session storage Redis or JobQueue Redis? There is a RedisBagOStuff, but afaik we don't use that for objectcache in beta or prod.

I would this would come from JobQueue Redis given that ApiSiteInfo is supports outputting a jobs count statistic. On the other hand, the code path for SiteStats::jobs() is only invoked by ApiSiteInfo for siprop=statistics, which is not being used here. So maybe it's coming from some place else.

Either way, I suppose it doesn't matter given the recursion is caused by HHVM native, not MediaWiki PHP.

EDIT: Grepping includes/api/ for various different terms, I found another call into Redis in ApiMain for the implementation of getMaxLag and JobQueueIncludeInMaxLagFactor.

EDIT 2: Any API query on Beta is failing, not ApiSiteInfo. See plain https://en.wikipedia.beta.wmflabs.org/w/api.php for example. So it's probably caused by the maxlag checker.

So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (trying to AUTH when no password set results in an error); using CONFIG SET requirepass <x> didn't make a difference though. In any case, I can use redis-cli to talk to the local redis instance on 01/02 themselves. I'm not sure how much of this is nutcracker vs redis. Restarting either does not help.

On logstash-beta i see

wiki:enwiki exception.trace:#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array, array) #1 [internal function]: Redis->processArrayCommand(string, array) #2 [internal function]: Redis->processCommand(string, string) #3 [internal function]: Redis->auth(string) #4 [internal function]: Redis->checkConnection() #5 [internal function]: Redis->sockReadLine() #6 [internal function]: Redis->sockReadData(NULL) #7 [internal function]: Redis->processBooleanResponse()

All of Beta MediaWiki seems to be down now.

tgr@deployment-mediawiki04:~$ curl -v -H 'Host: en.wikipedia.beta.wmflabs.org' http://localhost
...
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
  PHP fatal error: <br/>
  Stack overflow</code></p></div>
...
Anomie subscribed.

This has nothing to do with MediaWiki-Action-API, or as far as I can tell anything to do with code in MediaWiki at all. It seems to be completely internal to HHVM's Redis class and the fact that the redis servers are down in Beta.

tgr@deployment-mediawiki04:~$ telnet deployment-redis01.deployment-prep.eqiad.wmflabs 6379
Trying 10.68.16.177...
telnet: Unable to connect to remote host: Connection refused
tgr@deployment-redis01:~$ sudo netstat -ntlp | grep redis
tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      1624/redis-server 1

I'm a bit outside my comfort zone with network admin stuff, but doesn't this mean that redis is listening on the loopback interface and any external connection attempts are hopeless?

So this is partially my fault, sorry :/ I started redis with sudo service redis-server start but the redis service that is properly configured to serve external requests is called redis-instance-tcp_6379. (In my defense, only redis-server autocompletes, presumably because the other scripts are in /etc/init and not /etc/init.d so it never occurred to me I might be using the wrong service.) As for why the service needed manual starting in the first place, unlike most other Beta hosts redis01/02 is still on trusty, and presumably the upstart branch of the redis puppet module is not really maintained anymore.

I stopped the wrong service and started the right one, and redis works correctly now. nutcracker is still dead :(

Tgr lowered the priority of this task from High to Medium.

nutcracker is still dead :(

It just needed a restart, so beta is back again, yay!

(And it only took four hours or so of random flailing around with tools I half-understand. Not so yay.)

Not marking resolved yet as this will probably reoccur as soon as the redis servers get restarted for some reason.

Suggested followups:

  • migrate the remaining trusty servers or fix the redis upstart script in puppet
  • find a contact person for "system services are screwed up on Beta". It causes a lot of disruption for developers (all the browser and integration tests go wild, there is no way to test patches in a production-like environment etc) so it would be nice if it got resolved quickly; and in hindsight this was not a complicated problem.
  • upstream the HHVM recursion issue (did not test but just by eyeballing the code it seems to be still present in master)
  • document debug steps for redis & nutcracker on wikitech

Also I have zero clue why redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <correct password> and redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <wrong password> returned different errors on mediawiki04 even though the redis server didn't even listen for connections, that's super weird. (And it made me assume that Redis is at least reachable, and that led the investigation into unproductive directions.) Does nutcracker pre-check the password? Or does it do some kind of caching?

Imarlier moved this task from Limbo to Perf issue on the Performance-Team (Radar) board.
Imarlier subscribed.
This comment has been deleted.
Aklapper removed Tgr as the assignee of this task.Jun 19 2020, 4:19 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

We're no longer on HHVM.
We also use Redis for fewer things now.

Declining the old ticket for now. Feel free to re-open if it happens again.