Page MenuHomePhabricator

tools-redis broken
Closed, ResolvedPublic

Description

valhallasw@tools-trusty:~$ telnet tools-redis 22
Trying 10.68.16.18...
Connected to tools-redis.eqiad.wmflabs.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

valhallasw@tools-trusty:~$ telnet tools-redis 6379
Trying 10.68.16.18...
Connected to tools-redis.eqiad.wmflabs.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

valhallasw@tools-trusty:~$ telnet tools-redis 1234
Trying 10.68.16.18...
telnet: Unable to connect to remote host: Connection refused

So the host is listening on the correct ports but not responding at all.

Related Objects

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2015, 7:47 PM

OOM again? Console output shows

[3275819.012492] Killed process 4663 (redis-server) total-vm:15794460kB, anon-rss:15313012kB, file-rss:92kB
[3275840.851752] Out of memory: Kill process 27882 (redis-server) score 932 or sacrifice child

a gazillion times, then

[3277323.536049] INFO: task redis-server:4720 blocked for more than 120 seconds.
[3277323.540633]       Not tainted 3.13.0-45-generic #74-Ubuntu
[3277323.541238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

another gazillion times. Those time stamps are not very useful :(

Rebooted the instance; redis is back online. We still have to figure out what happened, though...

In any case, memory is an issue:

root@tools-redis:/var/log/redis# free -h
             total       used       free     shared    buffers     cached
Mem:           15G        15G       246M       384K        46M       2.8G
-/+ buffers/cache:        12G       3.1G
Swap:         487M         0B       487M

this also causes the slave to fail re-syncing:

[1339] 18 Apr 20:07:43.874 * Slave asks for synchronization
[1339] 18 Apr 20:07:43.874 * Full resync requested by slave.
[1339] 18 Apr 20:07:43.875 * Starting BGSAVE for SYNC
[1339] 18 Apr 20:07:43.875 # Can't save in background: fork: Cannot allocate memory
[1339] 18 Apr 20:07:43.875 * Replication failed, can't BGSAVE

But in general, the server just seems to have spontaneously died during the night; even the redis log is empty in that period:

[27882] 18 Apr 02:12:05.094 * Background saving started by pid 4720
[1339] 18 Apr 19:55:43.721 # Unable to set the max number of files limit to 10032 (Operation not permitted), setting the max clients configuration to 3984.

There seem to be *two* running redis processes every now and then, both using ~12GB of RAM, even though the host doesn't have that. This might be during a BGDUMP, as that fork()s the process...

ps auxf confirms this...

redis     1339  4.3 78.3 12994928 12878596 ?   Ssl  19:55   2:59 /usr/bin/redis-server *:6379
redis     7657  101 78.3 12995320 12878424 ?   R    21:04   0:18  \_ redis-rdb-bgsave *:6379

@yuvipanda just reminded me that linux uses copy-on-write for fork(), so that should be OK. The free memory does drop from 3.1G to ~2.4G, but that leaves more than enough breathing room -- so I'm not sure anymore this really was a memory issue.

coren added a comment.Apr 20 2015, 1:16 PM
[3275840.851752] Out of memory: Kill process 27882 (redis-server) score 932 or sacrifice child

Unambiguously tells us that the OOM killer was woken up - so there is a point at which memory use increases to the point of running out.

We might want to make sure no memory overcommit is possible on that box and let the sbreak()s fail - it won't fix the issue proper but will prevent the OOM killer from waking. (Also, if the process that does the allocation checks its mallocs at it should then it will be able to deal with the error in an appropriate way).

valhallasw closed this task as Resolved.Apr 27 2015, 12:27 PM
valhallasw claimed this task.

Overcommit was disabled by @yuvipanda and the biggest memory usage was solved by {T91979: Audit redis usage on toollabs}, so hopefully this is resolved.

So I actually *enabled* overcommit - redis can't actually work properly without it - http://redis.io/topics/faq ctrl-f 'overcommit'.

So I actually *enabled* overcommit - redis can't actually work properly without it - http://redis.io/topics/faq ctrl-f 'overcommit'.

This suggests it wasn't on before the crash, but it was on initially as well. Timeline from IRC:

22:04 <YuviPanda> months ago
22:04 <YuviPanda> before the apr 20th failure
22:04 <YuviPanda> when redis died
22:04 <YuviPanda> I put it back up
22:04 <YuviPanda> and enabled overcommit manually
22:04 <YuviPanda> and everything was fine
22:05 <YuviPanda> bgsave worked fine
22:05 <YuviPanda> and then on apr 20th it died
22:05 <YuviPanda> and server reboot
22:05 <YuviPanda> overcommit flag was cleared, and disabled.

which then caused bgsave to fail completely. @yuvipanda then re-enabled autocommit, and bgsave was alive again.

And with T91979, we don't have 15GB of data anymore, so also no issues for now.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 7 2017, 6:52 PM