We usually see the following icinga alerts that auto-recover after a while:
16:33 <icinga-wm> PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195378 keys, up 127 days 8 hours - replication_delay is 630 16:33 <icinga-wm> PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4194556 keys, up 127 days 8 hours - replication_delay is 651
rdb2006 is a slaveof rdb2005, that in turn is a slaveof rdb1007. Alex found the following error in the rdb1007's logs:
[1932] 07 Mar 16:22:29.058 # Client id=46013294502 addr=10.192.32.133:39192 fd=93 name= age=3134 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=9652 omem=375231792 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buff er limits. [1932] 07 Mar 16:22:29.156 # Connection with slave 10.192.32.133:6479 lost. [1932] 07 Mar 16:22:30.195 * Slave asks for synchronization [1932] 07 Mar 16:22:30.195 * Unable to partial resync with the slave for lack of backlog (Slave request was: 29621966377054). [1932] 07 Mar 16:22:30.195 * Starting BGSAVE for SYNC [1932] 07 Mar 16:22:30.268 * Background saving started by pid 25963 [25963] 07 Mar 16:22:56.859 * DB saved on disk [25963] 07 Mar 16:22:56.919 * RDB: 135 MB of memory used by copy-on-write [1932] 07 Mar 16:22:57.083 * Background saving terminated with success [1932] 07 Mar 16:23:10.887 * Starting automatic rewriting of AOF on 100% growth [1932] 07 Mar 16:23:10.974 * Background append only file rewriting started by pid 26074 [1932] 07 Mar 16:23:19.054 * Background AOF buffer size: 80 MB [26074] 07 Mar 16:23:37.761 * SYNC append only file rewrite performed [26074] 07 Mar 16:23:37.841 * AOF rewrite: 72 MB of memory used by copy-on-write [1932] 07 Mar 16:23:37.977 * Background AOF rewrite terminated with success [1932] 07 Mar 16:23:38.123 * Parent diff successfully flushed to the rewritten AOF (180596208 bytes) [1932] 07 Mar 16:23:38.123 * Background AOF rewrite finished successfully [1932] 07 Mar 16:23:54.591 # Client id=46017035166 addr=10.192.32.133:40892 fd=281 name= age=85 idle=84 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=13908 omem=536958816 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits. [1932] 07 Mar 16:23:54.641 # Connection with slave 10.192.32.133:6479 lost. [1932] 07 Mar 16:23:55.771 * Slave asks for synchronization [1932] 07 Mar 16:23:55.771 * Full resync requested by slave. [...]
Please note that our icinga alarms are meant to retry for several minutes before triggering a IRC notification.
It seems that we are hitting https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer, so Alex tried the following manual command on rdb1007:tcp_6479:
config set client-output-buffer-limit "normal 0 0 0 slave 2536870912 2536870912 60 pubsub 33554432 8388608 60"
In a matter of a couple of minutes Icinga showed rdb200[56] back in sync. We should investigate optimal value for the job queues and make them permanently.
Maybe this work should be done after reducing the complexity of rdb replication (we want to end up with a simple 1:1 replication scheme between eqiad and codfw).