@Krenair poked -ops to ask if anyone knew that there were lots of redis failures in the logs on flourine such as:
Mar 13 00:39:45 mw1243: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out Mar 13 00:39:47 mw1241: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out Mar 13 00:39:48 mw1242: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out Mar 13 00:39:51 mw1240: #012Warning: Failed connecting to redis server at 10.64.0.163: Connection timed out Mar 13 00:40:23 mw1228: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out Mar 13 00:40:23 mw1161: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out Mar 13 00:40:42 mw1253: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out Mar 13 00:40:43 mw1174: #012Warning: Failed connecting to redis server at 10.64.0.163: Connection timed out Mar 13 00:40:50 mw1234: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
First instance from the current log at that time:
Mar 12 07:14:37 mw1079: #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
The seeming relevant log from redis is:
[15410] 13 Mar 00:36:16.094 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
Redis is essentially choking on keeping up with AOF and it is blocking / dropping clients randomly. We have seen very little evidence in icinga but...
root@rbf1001:~# redis-cli --latency -h 127.0.0.1 -p 6379 min: 0, max: 91, avg: 11.14 (1184 samples)^C root@rbf1001:~# redis-cli --latency -h 127.0.0.1 -p 6379 min: 0, max: 79, avg: 14.89 (805 samples)^C
That's not great I think^
Interesting links:
http://engineering.sharethrough.com/blog/2013/06/06/how-redis-took-us-offline-and-what-we-did-about-it/
https://github.com/antirez/redis/issues/641
Seems like these have been in place since Sept https://phabricator.wikimedia.org/rOPUP69322ed601a0634ef694d922b7e17b5cadb086ca
From what I can gather in IRC it seems like this is an efficiency mechanism that fails back to standard lookup so I am not like sounding the alarms right now or anything but right now these boxes are only going to degrade further.