Page MenuHomePhabricator

Stack overflow when Redis is down
Open, NormalPublic

Description

As seen on beta:
https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=general&format=json

2018-01-16 23:40:04 [Wl6NVApEFhUAAAMB6D4AAAAC] deployment-mediawiki05 dewiki 1.31.0-alpha fatal ERROR: [ef765e72] PHP Fatal Error: Stack overflow {"fatal_exception":{"class":"ErrorException","message":"PHP Fatal Error: Stack overflow","code":16777217,"file":"","line":0,"trace":{"0":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"1":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"2":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"3":{"function":"auth","class":"Redis","type":"->","args":["string"]},"4":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"5":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"6":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"7":{"function":"auth","class":"Redis","type":"->","args":["string"]},"8":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"9":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"10":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"11":{"function":"auth","class":"Redis","type":"->","args":["string"]},"12":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"13":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"14":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"15":{"function":"auth","class":"Redis","type":"->","args":["string"]},"16":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"17":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"18":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"19":{"function":"auth","class":"Redis","type":"->","args":["string"]},"20":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"21":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"22":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"23":{"function":"auth","class":"Redis","type":"->","args":["string"]},"24":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"25":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"26":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"27":{"function":"auth","class":"Redis","type":"->","args":["string"]},"28":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"29":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"30":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"31":{"function":"auth","class":"Redis","type":"->","args":["string"]},"32":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"33":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"34":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"35":{"function":"auth","class":"Redis","type":"->","args":["string"]},"36":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"37":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"38":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},
...

This is causing siteinfo API requests (probably all API requests) to fail, which is causing the selenium-CentralAuth browsertest (probably a bunch of others as well) to fail. Also kills the job queue presumably.

There are two issues:

  • MediaWiki can't connect to Redis in Beta
  • HHVM is handling the connection error poorly

Event Timeline

Tgr created this task.Jan 16 2018, 11:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2018, 11:44 PM
Tgr added a comment.Jan 16 2018, 11:56 PM

Apparently HHVM's Redis implementation goes into an infinite loop when the Redis server is down :/

Tgr added a comment.Jan 17 2018, 12:11 AM

The redis-server service was down on redis-01/02, I did a manual restart, but that only changed the methods involved in the loop somewhat:

2018-01-17 00:09:03 [Wl6UHwpEFhUAAAMB6EYAAAAC] deployment-mediawiki05 enwiki 1.31.0-alpha fatal ERROR: [5214e78a] PHP Fatal Error: Stack overflow {"fatal_exception":{"class":"ErrorException","message":"PHP Fatal Error: Stack overflow","code":16777217,"file":"","line":0,"trace":{"0":{"function":"processArrayCommand","class":"Redis","type":"->","args":["string","array"]},"1":{"function":"processCommand","class":"Redis","type":"->","args":["string","string"]},"2":{"function":"auth","class":"Redis","type":"->","args":["string"]},"3":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"4":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"5":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"6":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"7":{"function":"auth","class":"Redis","type":"->","args":["string"]},"8":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"9":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"10":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"11":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"12":{"function":"auth","class":"Redis","type":"->","args":["string"]},"13":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"14":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"15":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"16":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"17":{"function":"auth","class":"Redis","type":"->","args":["string"]},"18":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"19":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"20":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"21":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"22":{"function":"auth","class":"Redis","type":"->","args":["string"]},"23":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"24":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"25":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"26":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"27":{"function":"auth","class":"Redis","type":"->","args":["string"]},"28":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"29":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"30":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"31":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"32":{"function":"auth","class":"Redis","type":"->","args":["string"]},"33":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"34":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"35":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"36":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"37":{"function":"auth","class":"Redis","type":"->","args":["string"]},"38":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"39":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"40":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]},"41":{"function":"processBooleanResponse","class":"Redis","type":"->","args":[]},"42":{"function":"auth","class":"Redis","type":"->","args":["string"]},"43":{"function":"checkConnection","class":"Redis","type":"->","args":[]},"44":{"function":"sockReadLine","class":"Redis","type":"->","args":[]},"45":{"function":"sockReadData","class":"Redis","type":"->","args":["NULL"]}
Tgr added a comment.EditedJan 17 2018, 12:17 AM

The first loop is

Redis::processArrayCommand
Redis::checkConnection
Redis::auth
Redis::processCommand
Redis::processArrayCommand
...

The second loop is

Redis::auth
Redis::processBooleanResponse
Redis::sockReadData
Redis::sockReadLine
Redis::checkConnection
Redis::auth
...
Tgr added a comment.Jan 17 2018, 12:34 AM

Caused by some issue with the redis servers:

redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <password> get test
(error) ERR Connection refused

This is different from the error given on a wrong password.

Tgr updated the task description. (Show Details)Jan 17 2018, 12:36 AM
Tgr updated the task description. (Show Details)Jan 17 2018, 12:49 AM
greg added subscribers: aaron, greg.

Asking for help from @aaron (kind of object cache) and Operations / Performance-Team to diagnose this. Ori was previously the best person to ask re these things.

Tgr updated the task description. (Show Details)Jan 17 2018, 4:39 PM
greg triaged this task as High priority.Jan 17 2018, 4:49 PM

This is almost UBN! per "This is causing siteinfo API requests (probably all API requests) to fail, which is causing the selenium-CentralAuth browsertest (probably a bunch of others as well) to fail."

Krinkle added a subscriber: Krinkle.EditedJan 17 2018, 5:53 PM

Is this task about session storage Redis or JobQueue Redis? There is a RedisBagOStuff, but afaik we don't use that for objectcache in beta or prod.

I would this would come from JobQueue Redis given that ApiSiteInfo is supports outputting a jobs count statistic. On the other hand, the code path for SiteStats::jobs() is only invoked by ApiSiteInfo for siprop=statistics, which is not being used here. So maybe it's coming from some place else.

Either way, I suppose it doesn't matter given the recursion is caused by HHVM native, not MediaWiki PHP.

EDIT: Grepping includes/api/ for various different terms, I found another call into Redis in ApiMain for the implementation of getMaxLag and JobQueueIncludeInMaxLagFactor.

EDIT 2: Any API query on Beta is failing, not ApiSiteInfo. See plain https://en.wikipedia.beta.wmflabs.org/w/api.php for example. So it's probably caused by the maxlag checker.

Krinkle added a project: MediaWiki-JobQueue.
aaron added a comment.Jan 17 2018, 6:07 PM

So I cannot contact redis via nutcracker on tin. I noticed the password was not actually set for redis (trying to AUTH when no password set results in an error); using CONFIG SET requirepass <x> didn't make a difference though. In any case, I can use redis-cli to talk to the local redis instance on 01/02 themselves. I'm not sure how much of this is nutcracker vs redis. Restarting either does not help.

On logstash-beta i see

wiki:enwiki exception.trace:#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array, array) #1 [internal function]: Redis->processArrayCommand(string, array) #2 [internal function]: Redis->processCommand(string, string) #3 [internal function]: Redis->auth(string) #4 [internal function]: Redis->checkConnection() #5 [internal function]: Redis->sockReadLine() #6 [internal function]: Redis->sockReadData(NULL) #7 [internal function]: Redis->processBooleanResponse()

Tgr added a comment.Jan 17 2018, 7:08 PM

All of Beta MediaWiki seems to be down now.

tgr@deployment-mediawiki04:~$ curl -v -H 'Host: en.wikipedia.beta.wmflabs.org' http://localhost
...
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
  PHP fatal error: <br/>
  Stack overflow</code></p></div>
...
Anomie removed a project: MediaWiki-API.EditedJan 17 2018, 7:12 PM
Anomie added a subscriber: Anomie.

This has nothing to do with MediaWiki-API, or as far as I can tell anything to do with code in MediaWiki at all. It seems to be completely internal to HHVM's Redis class and the fact that the redis servers are down in Beta.

Tgr added a comment.Jan 18 2018, 12:39 AM
tgr@deployment-mediawiki04:~$ telnet deployment-redis01.deployment-prep.eqiad.wmflabs 6379
Trying 10.68.16.177...
telnet: Unable to connect to remote host: Connection refused
tgr@deployment-redis01:~$ sudo netstat -ntlp | grep redis
tcp        0      0 127.0.0.1:6379          0.0.0.0:*               LISTEN      1624/redis-server 1

I'm a bit outside my comfort zone with network admin stuff, but doesn't this mean that redis is listening on the loopback interface and any external connection attempts are hopeless?

Tgr added a comment.Jan 18 2018, 1:12 AM

So this is partially my fault, sorry :/ I started redis with sudo service redis-server start but the redis service that is properly configured to serve external requests is called redis-instance-tcp_6379. (In my defense, only redis-server autocompletes, presumably because the other scripts are in /etc/init and not /etc/init.d so it never occurred to me I might be using the wrong service.) As for why the service needed manual starting in the first place, unlike most other Beta hosts redis01/02 is still on trusty, and presumably the upstart branch of the redis puppet module is not really maintained anymore.

I stopped the wrong service and started the right one, and redis works correctly now. nutcracker is still dead :(

Tgr claimed this task.Jan 18 2018, 1:25 AM
Tgr lowered the priority of this task from High to Normal.

nutcracker is still dead :(

It just needed a restart, so beta is back again, yay!

(And it only took four hours or so of random flailing around with tools I half-understand. Not so yay.)

Not marking resolved yet as this will probably reoccur as soon as the redis servers get restarted for some reason.

Suggested followups:

  • migrate the remaining trusty servers or fix the redis upstart script in puppet
  • find a contact person for "system services are screwed up on Beta". It causes a lot of disruption for developers (all the browser and integration tests go wild, there is no way to test patches in a production-like environment etc) so it would be nice if it got resolved quickly; and in hindsight this was not a complicated problem.
  • upstream the HHVM recursion issue (did not test but just by eyeballing the code it seems to be still present in master)
  • document debug steps for redis & nutcracker on wikitech
Tgr added a comment.Jan 18 2018, 1:30 AM

Also I have zero clue why redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <correct password> and redis-cli -s /var/run/nutcracker/redis_eqiad.sock -a <wrong password> returned different errors on mediawiki04 even though the redis server didn't even listen for connections, that's super weird. (And it made me assume that Redis is at least reachable, and that led the investigation into unproductive directions.) Does nutcracker pre-check the password? Or does it do some kind of caching?

Imarlier moved this task from Limbo to Perf issue on the Performance-Team (Radar) board.
Imarlier added a subscriber: Imarlier.
This comment has been deleted.