JobQueue Redis codfw replicas periodically lagging
Open, NormalPublic

Description

We usually see the following icinga alerts that auto-recover after a while:

16:33  <icinga-wm> PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195378 keys, up 127 days 8 hours - replication_delay is 630
16:33  <icinga-wm> PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4194556 keys, up 127 days 8 hours - replication_delay is 651

rdb2006 is a slaveof rdb2005, that in turn is a slaveof rdb1007. Alex found the following error in the rdb1007's logs:

[1932] 07 Mar 16:22:29.058 # Client id=46013294502 addr=10.192.32.133:39192 fd=93 name= age=3134 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=9652 omem=375231792 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buff
er limits.
[1932] 07 Mar 16:22:29.156 # Connection with slave 10.192.32.133:6479 lost.
[1932] 07 Mar 16:22:30.195 * Slave asks for synchronization
[1932] 07 Mar 16:22:30.195 * Unable to partial resync with the slave for lack of backlog (Slave request was: 29621966377054).
[1932] 07 Mar 16:22:30.195 * Starting BGSAVE for SYNC
[1932] 07 Mar 16:22:30.268 * Background saving started by pid 25963
[25963] 07 Mar 16:22:56.859 * DB saved on disk
[25963] 07 Mar 16:22:56.919 * RDB: 135 MB of memory used by copy-on-write
[1932] 07 Mar 16:22:57.083 * Background saving terminated with success
[1932] 07 Mar 16:23:10.887 * Starting automatic rewriting of AOF on 100% growth
[1932] 07 Mar 16:23:10.974 * Background append only file rewriting started by pid 26074
[1932] 07 Mar 16:23:19.054 * Background AOF buffer size: 80 MB
[26074] 07 Mar 16:23:37.761 * SYNC append only file rewrite performed
[26074] 07 Mar 16:23:37.841 * AOF rewrite: 72 MB of memory used by copy-on-write
[1932] 07 Mar 16:23:37.977 * Background AOF rewrite terminated with success
[1932] 07 Mar 16:23:38.123 * Parent diff successfully flushed to the rewritten AOF (180596208 bytes)
[1932] 07 Mar 16:23:38.123 * Background AOF rewrite finished successfully
[1932] 07 Mar 16:23:54.591 # Client id=46017035166 addr=10.192.32.133:40892 fd=281 name= age=85 idle=84 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=13908 omem=536958816 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer
 limits.
[1932] 07 Mar 16:23:54.641 # Connection with slave 10.192.32.133:6479 lost.
[1932] 07 Mar 16:23:55.771 * Slave asks for synchronization
[1932] 07 Mar 16:23:55.771 * Full resync requested by slave.
[...]

Please note that our icinga alarms are meant to retry for several minutes before triggering a IRC notification.

It seems that we are hitting https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer, so Alex tried the following manual command on rdb1007:tcp_6479:

config set client-output-buffer-limit "normal 0 0 0 slave 2536870912 2536870912 60 pubsub 33554432 8388608 60"

In a matter of a couple of minutes Icinga showed rdb200[56] back in sync. We should investigate optimal value for the job queues and make them permanently.

Maybe this work should be done after reducing the complexity of rdb replication (we want to end up with a simple 1:1 replication scheme between eqiad and codfw).

elukey created this task.Tue, Mar 7, 5:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Mar 7, 5:09 PM
elukey edited the task description. (Show Details)Tue, Mar 7, 5:13 PM
elukey added a project: User-Elukey.

Looking at the RDB file size for that specific instance, it's currently 1.4G on rdb1007. It's clear that the configured limits of 500MB hard , 200MB per 60 secs soft are not enough for that. A valid question however is how did the DB grow to that size.

elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Wed, Mar 8, 1:14 PM
elukey added a comment.EditedMon, Mar 13, 4:08 PM
elukey@neodymium:~$ sudo -i salt -E 'rdb100[1357].eqiad.wmnet' cmd.run "du -hs /srv/redis/* | sort -h"
rdb1007.eqiad.wmnet:
    12K /srv/redis/rdb1007-6378.rdb
    20K /srv/redis/rdb1007-6378.aof
    334M        /srv/redis/rdb1007-6381.rdb
    335M        /srv/redis/rdb1007-6380.rdb
    1.1G        /srv/redis/rdb1007-6379.rdb
    1.1G        /srv/redis/rdb1007-6380.aof
    1.5G        /srv/redis/rdb1007-6381.aof
    2.5G        /srv/redis/rdb1007-6379.aof
rdb1003.eqiad.wmnet:
    650M        /srv/redis/rdb1003-6378.rdb
    820M        /srv/redis/rdb1003-6381.rdb
    830M        /srv/redis/rdb1003-6380.rdb
    831M        /srv/redis/rdb1003-6379.rdb
    2.1G        /srv/redis/rdb1003-6381.aof
    2.4G        /srv/redis/rdb1003-6380.aof
    2.6G        /srv/redis/rdb1003-6378.aof
    3.3G        /srv/redis/rdb1003-6379.aof
rdb1005.eqiad.wmnet:
    48K /srv/redis/rdb1005-6378.rdb
    76K /srv/redis/rdb1005-6378.aof
    301M        /srv/redis/rdb1005-6379.rdb
    301M        /srv/redis/rdb1005-6380.rdb
    301M        /srv/redis/rdb1005-6381.rdb
    1.4G        /srv/redis/rdb1005-6381.aof
    1.5G        /srv/redis/rdb1005-6380.aof
    1.6G        /srv/redis/rdb1005-6379.aof
rdb1001.eqiad.wmnet:
    64K /srv/redis/rdb1001-6378.rdb
    338M        /srv/redis/rdb1001-6380.rdb
    341M        /srv/redis/rdb1001-6381.rdb
    484M        /srv/redis/rdb1001-6378.aof
    866M        /srv/redis/rdb1001-6379.rdb
    1.5G        /srv/redis/rdb1001-6380.aof
    1.5G        /srv/redis/rdb1001-6381.aof
    1.8G        /srv/redis/rdb1001-6379.aof

Number of keys:

elukey@rdb1007:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=4,expires=0,avg_ttl=0
6379
db0:keys=4043206,expires=4040878,avg_ttl=1160204506
6380
db0:keys=4043764,expires=4041289,avg_ttl=1220147207
6381
db0:keys=4044297,expires=4041872,avg_ttl=1050451847

elukey@rdb1003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=4705607,expires=0,avg_ttl=0
6379
db0:keys=8753222,expires=4048090,avg_ttl=1259105679
6380
db0:keys=8747860,expires=4042678,avg_ttl=1273141405
6381
db0:keys=8660122,expires=4043189,avg_ttl=1125295546

elukey@rdb1001:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=15,expires=0,avg_ttl=0
6379
db0:keys=8743507,expires=4038267,avg_ttl=1153199227
6380
db0:keys=4046272,expires=4043976,avg_ttl=1237364313
6381
db0:keys=4045598,expires=4043277,avg_ttl=1222512715

elukey@rdb1005:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=3,expires=0,avg_ttl=0
6379
db0:keys=4044909,expires=4042693,avg_ttl=1128640267
6380
db0:keys=4048189,expires=4045966,avg_ttl=1273315676
6381
db0:keys=4043837,expires=4041633,avg_ttl=1276522812

I ran redis-cli --big-keys -i 0.1 for rdb1008:6379 (slaveof rdb1007) and got this result:

-------- summary -------

Sampled 4000451 keys in the keyspace!
Total key length in bytes is 332250282 (avg len 83.05)

Biggest string found 'wikidatawiki:jobqueue:htmlCacheUpdate:rootjob:d8efd448b6c2ab1e96b1cee6c6a96d0dcfdc0601' has 14 bytes
Biggest   list found 'enwiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed' has 2419 items
Biggest    set found 'jobqueue:aggregator:s-wikis:v2' has 946 members
Biggest   hash found 'ocg_job_status' has 609723 fields
Biggest   zset found 'metawiki:jobqueue:translationNotificationJob:z-abandoned' has 900 members

3998641 strings with 55980973 bytes (99.95% of keys, avg size 14.00)
107 lists with 8132 items (00.00% of keys, avg size 76.00)
2 sets with 1371 members (00.00% of keys, avg size 685.50)
1304 hashs with 646510 fields (00.03% of keys, avg size 495.79)
397 zsets with 4425 members (00.01% of keys, avg size 11.15)
elukey@neodymium:~$ sudo -i salt -E 'rdb100[1357].eqiad.wmnet' cmd.run "du -hs /srv/redis/* | sort -h"
rdb1007.eqiad.wmnet:
    12K /srv/redis/rdb1007-6378.rdb
    20K /srv/redis/rdb1007-6378.aof
    334M        /srv/redis/rdb1007-6381.rdb
    335M        /srv/redis/rdb1007-6380.rdb
    1.1G        /srv/redis/rdb1007-6379.rdb
    1.1G        /srv/redis/rdb1007-6380.aof
    1.5G        /srv/redis/rdb1007-6381.aof
    2.5G        /srv/redis/rdb1007-6379.aof
rdb1003.eqiad.wmnet:
    650M        /srv/redis/rdb1003-6378.rdb
    820M        /srv/redis/rdb1003-6381.rdb
    830M        /srv/redis/rdb1003-6380.rdb
    831M        /srv/redis/rdb1003-6379.rdb
    2.1G        /srv/redis/rdb1003-6381.aof
    2.4G        /srv/redis/rdb1003-6380.aof
    2.6G        /srv/redis/rdb1003-6378.aof
    3.3G        /srv/redis/rdb1003-6379.aof
rdb1005.eqiad.wmnet:
    48K /srv/redis/rdb1005-6378.rdb
    76K /srv/redis/rdb1005-6378.aof
    301M        /srv/redis/rdb1005-6379.rdb
    301M        /srv/redis/rdb1005-6380.rdb
    301M        /srv/redis/rdb1005-6381.rdb
    1.4G        /srv/redis/rdb1005-6381.aof
    1.5G        /srv/redis/rdb1005-6380.aof
    1.6G        /srv/redis/rdb1005-6379.aof
rdb1001.eqiad.wmnet:
    64K /srv/redis/rdb1001-6378.rdb
    338M        /srv/redis/rdb1001-6380.rdb
    341M        /srv/redis/rdb1001-6381.rdb
    484M        /srv/redis/rdb1001-6378.aof
    866M        /srv/redis/rdb1001-6379.rdb
    1.5G        /srv/redis/rdb1001-6380.aof
    1.5G        /srv/redis/rdb1001-6381.aof
    1.8G        /srv/redis/rdb1001-6379.aof

So, according to this, various replications would not recover if broken due to the buffer limits we have in place. Makes me think we should raise those limits anyway.

Joe added a comment.Thu, Mar 16, 9:19 AM

@akosiaris are you sure about that? If replica is broken the rdb file is transferred and from what I see only some are larger than 500 MB.

We should probably do the following, and we better do it before the switchover:

  • raise client-output-buffer-limits to 2 GB
  • inspect data in the redis instances to see if some GC-like process is needed to trim unused keys. I suspect we have quite a bunch of useless data there.

@Joe we could use SCAN/KEYS in a loop and check TTL and OBJECT IDLETIME of all the Redis keys in the eqiad slaves, getting some useful statistics about the Job queues.

@akosiaris are you sure about that? If replica is broken the rdb file is transferred and from what I see only some are larger than 500 MB.

I think we are agreeing here. Various replications would not recover due to various rdb files being larger than 500MB. I count 6/16 in @elukey's numbers above.

We should probably do the following, and we better do it before the switchover:

  • raise client-output-buffer-limits to 2 GB

I concur.

  • inspect data in the redis instances to see if some GC-like process is needed to trim unused keys. I suspect we have quite a bunch of useless data there.

As far as keys go, @elukey's commands show that only 1 instance (rdb1003:6378) has many keys with no TTLs set. It's one of the 6 I 've counted above as "would-break". We should definitely have a look at that and see why that happens, maybe we can find a bug and fix it. That seems like a low hanging fruit.

@akosiaris: I would also look in a different way - is it normal that the avg_ttl is so high for the keys in the job queues? If I got it correctly those are seconds, so there are definitely some keys that are not going to expire in a very loooong time.

Change 343027 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet] redis: Increase the hard limit for slave output buffer

https://gerrit.wikimedia.org/r/343027

Recapping the IRC discussion. The average TTL is around 13 days (the values are ms) which seems quite reasonable given the jobqueue history

Mentioned in SAL (#wikimedia-operations) [2017-03-20T11:39:51Z] <akosiaris> return rdb1007 client-output-buffer-limit config to initially configured value T159850

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:32:23Z] <akosiaris> disable puppet on all rdb* nodes to shepherd https://gerrit.wikimedia.org/r/343027 into production. T159850

Change 343027 merged by Alexandros Kosiaris:
[operations/puppet] redis: Increase the hard limit for slave output buffer

https://gerrit.wikimedia.org/r/343027

Change 343669 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet] Change the redis slave client_output_buffer_limit

https://gerrit.wikimedia.org/r/343669

Change 343669 merged by Alexandros Kosiaris:
[operations/puppet] Change the redis slave client_output_buffer_limit

https://gerrit.wikimedia.org/r/343669

Mentioned in SAL (#wikimedia-operations) [2017-03-20T17:35:52Z] <akosiaris> slow rolling restart of redis databases in codfw T159850

redis databases on codfw are picking up the change as we speak. I 'll be monitoring and doing EQIAD tomorrow European morning.