Page MenuHomePhabricator

JobQueue Redis codfw replicas periodically lagging
Closed, ResolvedPublic

Description

We usually see the following icinga alerts that auto-recover after a while:

16:33  <icinga-wm> PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 630 600 - REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 4195378 keys, up 127 days 8 hours - replication_delay is 630
16:33  <icinga-wm> PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 4194556 keys, up 127 days 8 hours - replication_delay is 651

rdb2006 is a slaveof rdb2005, that in turn is a slaveof rdb1007. Alex found the following error in the rdb1007's logs:

[1932] 07 Mar 16:22:29.058 # Client id=46013294502 addr=10.192.32.133:39192 fd=93 name= age=3134 idle=1 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=9652 omem=375231792 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buff
er limits.
[1932] 07 Mar 16:22:29.156 # Connection with slave 10.192.32.133:6479 lost.
[1932] 07 Mar 16:22:30.195 * Slave asks for synchronization
[1932] 07 Mar 16:22:30.195 * Unable to partial resync with the slave for lack of backlog (Slave request was: 29621966377054).
[1932] 07 Mar 16:22:30.195 * Starting BGSAVE for SYNC
[1932] 07 Mar 16:22:30.268 * Background saving started by pid 25963
[25963] 07 Mar 16:22:56.859 * DB saved on disk
[25963] 07 Mar 16:22:56.919 * RDB: 135 MB of memory used by copy-on-write
[1932] 07 Mar 16:22:57.083 * Background saving terminated with success
[1932] 07 Mar 16:23:10.887 * Starting automatic rewriting of AOF on 100% growth
[1932] 07 Mar 16:23:10.974 * Background append only file rewriting started by pid 26074
[1932] 07 Mar 16:23:19.054 * Background AOF buffer size: 80 MB
[26074] 07 Mar 16:23:37.761 * SYNC append only file rewrite performed
[26074] 07 Mar 16:23:37.841 * AOF rewrite: 72 MB of memory used by copy-on-write
[1932] 07 Mar 16:23:37.977 * Background AOF rewrite terminated with success
[1932] 07 Mar 16:23:38.123 * Parent diff successfully flushed to the rewritten AOF (180596208 bytes)
[1932] 07 Mar 16:23:38.123 * Background AOF rewrite finished successfully
[1932] 07 Mar 16:23:54.591 # Client id=46017035166 addr=10.192.32.133:40892 fd=281 name= age=85 idle=84 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=13908 omem=536958816 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer
 limits.
[1932] 07 Mar 16:23:54.641 # Connection with slave 10.192.32.133:6479 lost.
[1932] 07 Mar 16:23:55.771 * Slave asks for synchronization
[1932] 07 Mar 16:23:55.771 * Full resync requested by slave.
[...]

Please note that our icinga alarms are meant to retry for several minutes before triggering a IRC notification.

It seems that we are hitting https://redislabs.com/blog/top-redis-headaches-for-devops-replication-buffer, so Alex tried the following manual command on rdb1007:tcp_6479:

config set client-output-buffer-limit "normal 0 0 0 slave 2536870912 2536870912 60 pubsub 33554432 8388608 60"

In a matter of a couple of minutes Icinga showed rdb200[56] back in sync. We should investigate optimal value for the job queues and make them permanently.

Maybe this work should be done after reducing the complexity of rdb replication (we want to end up with a simple 1:1 replication scheme between eqiad and codfw).

Event Timeline

Looking at the RDB file size for that specific instance, it's currently 1.4G on rdb1007. It's clear that the configured limits of 500MB hard , 200MB per 60 secs soft are not enough for that. A valid question however is how did the DB grow to that size.

elukey@neodymium:~$ sudo -i salt -E 'rdb100[1357].eqiad.wmnet' cmd.run "du -hs /srv/redis/* | sort -h"
rdb1007.eqiad.wmnet:
    12K /srv/redis/rdb1007-6378.rdb
    20K /srv/redis/rdb1007-6378.aof
    334M        /srv/redis/rdb1007-6381.rdb
    335M        /srv/redis/rdb1007-6380.rdb
    1.1G        /srv/redis/rdb1007-6379.rdb
    1.1G        /srv/redis/rdb1007-6380.aof
    1.5G        /srv/redis/rdb1007-6381.aof
    2.5G        /srv/redis/rdb1007-6379.aof
rdb1003.eqiad.wmnet:
    650M        /srv/redis/rdb1003-6378.rdb
    820M        /srv/redis/rdb1003-6381.rdb
    830M        /srv/redis/rdb1003-6380.rdb
    831M        /srv/redis/rdb1003-6379.rdb
    2.1G        /srv/redis/rdb1003-6381.aof
    2.4G        /srv/redis/rdb1003-6380.aof
    2.6G        /srv/redis/rdb1003-6378.aof
    3.3G        /srv/redis/rdb1003-6379.aof
rdb1005.eqiad.wmnet:
    48K /srv/redis/rdb1005-6378.rdb
    76K /srv/redis/rdb1005-6378.aof
    301M        /srv/redis/rdb1005-6379.rdb
    301M        /srv/redis/rdb1005-6380.rdb
    301M        /srv/redis/rdb1005-6381.rdb
    1.4G        /srv/redis/rdb1005-6381.aof
    1.5G        /srv/redis/rdb1005-6380.aof
    1.6G        /srv/redis/rdb1005-6379.aof
rdb1001.eqiad.wmnet:
    64K /srv/redis/rdb1001-6378.rdb
    338M        /srv/redis/rdb1001-6380.rdb
    341M        /srv/redis/rdb1001-6381.rdb
    484M        /srv/redis/rdb1001-6378.aof
    866M        /srv/redis/rdb1001-6379.rdb
    1.5G        /srv/redis/rdb1001-6380.aof
    1.5G        /srv/redis/rdb1001-6381.aof
    1.8G        /srv/redis/rdb1001-6379.aof

Number of keys:

elukey@rdb1007:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=4,expires=0,avg_ttl=0
6379
db0:keys=4043206,expires=4040878,avg_ttl=1160204506
6380
db0:keys=4043764,expires=4041289,avg_ttl=1220147207
6381
db0:keys=4044297,expires=4041872,avg_ttl=1050451847

elukey@rdb1003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=4705607,expires=0,avg_ttl=0
6379
db0:keys=8753222,expires=4048090,avg_ttl=1259105679
6380
db0:keys=8747860,expires=4042678,avg_ttl=1273141405
6381
db0:keys=8660122,expires=4043189,avg_ttl=1125295546

elukey@rdb1001:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=15,expires=0,avg_ttl=0
6379
db0:keys=8743507,expires=4038267,avg_ttl=1153199227
6380
db0:keys=4046272,expires=4043976,avg_ttl=1237364313
6381
db0:keys=4045598,expires=4043277,avg_ttl=1222512715

elukey@rdb1005:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep keys=; done
6378
db0:keys=3,expires=0,avg_ttl=0
6379
db0:keys=4044909,expires=4042693,avg_ttl=1128640267
6380
db0:keys=4048189,expires=4045966,avg_ttl=1273315676
6381
db0:keys=4043837,expires=4041633,avg_ttl=1276522812

I ran redis-cli --big-keys -i 0.1 for rdb1008:6379 (slaveof rdb1007) and got this result:

-------- summary -------

Sampled 4000451 keys in the keyspace!
Total key length in bytes is 332250282 (avg len 83.05)

Biggest string found 'wikidatawiki:jobqueue:htmlCacheUpdate:rootjob:d8efd448b6c2ab1e96b1cee6c6a96d0dcfdc0601' has 14 bytes
Biggest   list found 'enwiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed' has 2419 items
Biggest    set found 'jobqueue:aggregator:s-wikis:v2' has 946 members
Biggest   hash found 'ocg_job_status' has 609723 fields
Biggest   zset found 'metawiki:jobqueue:translationNotificationJob:z-abandoned' has 900 members

3998641 strings with 55980973 bytes (99.95% of keys, avg size 14.00)
107 lists with 8132 items (00.00% of keys, avg size 76.00)
2 sets with 1371 members (00.00% of keys, avg size 685.50)
1304 hashs with 646510 fields (00.03% of keys, avg size 495.79)
397 zsets with 4425 members (00.01% of keys, avg size 11.15)
elukey@neodymium:~$ sudo -i salt -E 'rdb100[1357].eqiad.wmnet' cmd.run "du -hs /srv/redis/* | sort -h"
rdb1007.eqiad.wmnet:
    12K /srv/redis/rdb1007-6378.rdb
    20K /srv/redis/rdb1007-6378.aof
    334M        /srv/redis/rdb1007-6381.rdb
    335M        /srv/redis/rdb1007-6380.rdb
    1.1G        /srv/redis/rdb1007-6379.rdb
    1.1G        /srv/redis/rdb1007-6380.aof
    1.5G        /srv/redis/rdb1007-6381.aof
    2.5G        /srv/redis/rdb1007-6379.aof
rdb1003.eqiad.wmnet:
    650M        /srv/redis/rdb1003-6378.rdb
    820M        /srv/redis/rdb1003-6381.rdb
    830M        /srv/redis/rdb1003-6380.rdb
    831M        /srv/redis/rdb1003-6379.rdb
    2.1G        /srv/redis/rdb1003-6381.aof
    2.4G        /srv/redis/rdb1003-6380.aof
    2.6G        /srv/redis/rdb1003-6378.aof
    3.3G        /srv/redis/rdb1003-6379.aof
rdb1005.eqiad.wmnet:
    48K /srv/redis/rdb1005-6378.rdb
    76K /srv/redis/rdb1005-6378.aof
    301M        /srv/redis/rdb1005-6379.rdb
    301M        /srv/redis/rdb1005-6380.rdb
    301M        /srv/redis/rdb1005-6381.rdb
    1.4G        /srv/redis/rdb1005-6381.aof
    1.5G        /srv/redis/rdb1005-6380.aof
    1.6G        /srv/redis/rdb1005-6379.aof
rdb1001.eqiad.wmnet:
    64K /srv/redis/rdb1001-6378.rdb
    338M        /srv/redis/rdb1001-6380.rdb
    341M        /srv/redis/rdb1001-6381.rdb
    484M        /srv/redis/rdb1001-6378.aof
    866M        /srv/redis/rdb1001-6379.rdb
    1.5G        /srv/redis/rdb1001-6380.aof
    1.5G        /srv/redis/rdb1001-6381.aof
    1.8G        /srv/redis/rdb1001-6379.aof

So, according to this, various replications would not recover if broken due to the buffer limits we have in place. Makes me think we should raise those limits anyway.

@akosiaris are you sure about that? If replica is broken the rdb file is transferred and from what I see only some are larger than 500 MB.

We should probably do the following, and we better do it before the switchover:

  • raise client-output-buffer-limits to 2 GB
  • inspect data in the redis instances to see if some GC-like process is needed to trim unused keys. I suspect we have quite a bunch of useless data there.

@Joe we could use SCAN/KEYS in a loop and check TTL and OBJECT IDLETIME of all the Redis keys in the eqiad slaves, getting some useful statistics about the Job queues.

@akosiaris are you sure about that? If replica is broken the rdb file is transferred and from what I see only some are larger than 500 MB.

I think we are agreeing here. Various replications would not recover due to various rdb files being larger than 500MB. I count 6/16 in @elukey's numbers above.

We should probably do the following, and we better do it before the switchover:

  • raise client-output-buffer-limits to 2 GB

I concur.

  • inspect data in the redis instances to see if some GC-like process is needed to trim unused keys. I suspect we have quite a bunch of useless data there.

As far as keys go, @elukey's commands show that only 1 instance (rdb1003:6378) has many keys with no TTLs set. It's one of the 6 I 've counted above as "would-break". We should definitely have a look at that and see why that happens, maybe we can find a bug and fix it. That seems like a low hanging fruit.

@akosiaris: I would also look in a different way - is it normal that the avg_ttl is so high for the keys in the job queues? If I got it correctly those are seconds, so there are definitely some keys that are not going to expire in a very loooong time.

Change 343027 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet] redis: Increase the hard limit for slave output buffer

https://gerrit.wikimedia.org/r/343027

Recapping the IRC discussion. The average TTL is around 13 days (the values are ms) which seems quite reasonable given the jobqueue history

Mentioned in SAL (#wikimedia-operations) [2017-03-20T11:39:51Z] <akosiaris> return rdb1007 client-output-buffer-limit config to initially configured value T159850

Mentioned in SAL (#wikimedia-operations) [2017-03-20T14:32:23Z] <akosiaris> disable puppet on all rdb* nodes to shepherd https://gerrit.wikimedia.org/r/343027 into production. T159850

Change 343027 merged by Alexandros Kosiaris:
[operations/puppet] redis: Increase the hard limit for slave output buffer

https://gerrit.wikimedia.org/r/343027

Change 343669 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet] Change the redis slave client_output_buffer_limit

https://gerrit.wikimedia.org/r/343669

Change 343669 merged by Alexandros Kosiaris:
[operations/puppet] Change the redis slave client_output_buffer_limit

https://gerrit.wikimedia.org/r/343669

Mentioned in SAL (#wikimedia-operations) [2017-03-20T17:35:52Z] <akosiaris> slow rolling restart of redis databases in codfw T159850

redis databases on codfw are picking up the change as we speak. I 'll be monitoring and doing EQIAD tomorrow European morning.

akosiaris claimed this task.

It seems we may have resolved this. I 'll be resolving the task, we can always reopen

And reopening :-). Replication failed for rdb2005 and rdb2006 (it's a cascading replication, failure of rdb2005 is expected to break rd2006). rdb2005 is under heavy IO load, so it might not be able to catch up in time. Seems like increasing the soft limit will help with his as the fails on rdb1007 talk about output memory buffer sizes of 300+MB

Change 344399 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Increase client_output_buffer_limit soft limit

https://gerrit.wikimedia.org/r/344399

Change 344399 merged by Alexandros Kosiaris:
[operations/puppet@production] Increase client_output_buffer_limit soft limit

https://gerrit.wikimedia.org/r/344399

Change 344411 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] Followup increasing client_output_buffer_limit

https://gerrit.wikimedia.org/r/344411

Change 344411 merged by Alexandros Kosiaris:
[operations/puppet@production] Followup increasing client_output_buffer_limit

https://gerrit.wikimedia.org/r/344411

And with the above changes merged and shepherd into the fleet. I am gonna re-declare this resolved (for now).

Mentioned in SAL (#wikimedia-operations) [2017-04-13T11:37:18Z] <elukey> temporary set config set client-output-buffer-limit "slave 2147483648 2147483648 60" on rdb1007:6379 to give time to rdb2005's replication to catch up - T159850

A bit of recap to explain what happened today. While I was testing T125735#3177819 on rdb2005:6380 I noticed some Icinga alarms related to rdb2005:6479 not in sync with its master (rdb1007:6379).

I reverted my change just to be sure that I wasn't the cause and I checked logs on rdb1007/rdb2005. The connection between master and slave was dropped and the slave asked for a fully resync as always.

The only value that made everything work again is the following:

6379
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 5368709120 5368709120 180 pubsub 33554432 8388608 60"

This is the last (successful) sync for rdb2005:

[10831] 13 Apr 13:02:01.017 # Connection with slave 10.192.48.44:6479 lost.
[10831] 13 Apr 13:02:01.858 * Connecting to MASTER 10.64.32.18:6379
[10831] 13 Apr 13:02:01.858 * MASTER <-> SLAVE sync started
[10831] 13 Apr 13:02:01.895 * Non blocking connect for SYNC fired the event.
[10831] 13 Apr 13:02:01.931 * Master replied to PING, replication can continue...
[10831] 13 Apr 13:02:02.003 * Trying a partial resynchronization (request ed58b7dcc8d466f882c5c2933345db60f2fb5221:2981544255531).
[10831] 13 Apr 13:02:02.040 * Full resync from master: ed58b7dcc8d466f882c5c2933345db60f2fb5221:2986219651912
[10831] 13 Apr 13:02:02.040 * Discarding previously cached master state.
[10831] 13 Apr 13:02:34.032 * MASTER <-> SLAVE sync: receiving 1653643937 bytes from master
[10831] 13 Apr 13:10:18.189 * MASTER <-> SLAVE sync: Flushing old data
[10831] 13 Apr 13:10:25.398 * MASTER <-> SLAVE sync: Loading DB in memory
[10831] 13 Apr 13:10:42.769 * MASTER <-> SLAVE sync: Finished with success

Afaics the total amount of bytes that rdb1007 should have sent is 1653643937 (1.6G), that is confirmed by the size of the rdb file on rdb1007:

elukey@rdb1007:~$ ls -lh /srv/redis/
total 12G
-rw-r--r-- 1 redis redis  18K Jan 26 18:10 rdb1007-6378.aof
-rw-r--r-- 1 redis redis  11K Mar 23 16:15 rdb1007-6378.rdb
-rw-r--r-- 1 redis redis 4.4G Apr 13 13:17 rdb1007-6379.aof

-rw-r--r-- 1 redis redis 1.6G Apr 13 13:02 rdb1007-6379.rdb <----

-rw-r--r-- 1 redis redis 1.5G Apr 13 13:17 rdb1007-6380.aof
-rw-r--r-- 1 redis redis 414M Mar 23 16:15 rdb1007-6380.rdb
-rw-r--r-- 1 redis redis 899M Apr 13 13:17 rdb1007-6381.aof
-rw-r--r-- 1 redis redis 410M Mar 23 16:15 rdb1007-6381.rdb

I tried with:

  • slave 2147483648 2147483648 60
  • slave 2536870912 2536870912 60
  • slave 3221225472 3221225472 180
  • slave 5368709120 5368709120 180

The file should have been created as part of the BGSAVE performed by the master (we don't save anymore rdb files but only rely on AOF afaik).

Clearly Redis sends a lot more data than the one contained in the rdb file, but why? Should we raise the output-buffer-limits again before the switchover?

Mentioned in SAL (#wikimedia-operations) [2017-04-13T16:55:37Z] <elukey> restored default value of client-output-buffer-limit on rdb1007:6379 - T159850

From http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.Redis.Versions.html:

Redis Versions Prior to 2.8.22

Redis backup and synchronization in versions prior to 2.8.22, is a three-step process.

Fork, and in the background process, serialize the cluster data to disk. This creates a point-in-time snapshot.

In the foreground, accumulate a change log in the client output buffer.

Important
If the change log exceeds the client output buffer size, the backup or synchronization fails. For more information, see Ensuring You Have Sufficient Memory to Create a Redis Snapshot.

Finally, transmit the cache data and then the change log to the replica cluster.

So afaics the client output buffer size is not related to the size of the rdb but to the amount of requests/commands landing to Redis from the moment in which it creates the rdb file onward.

It is still weird to think that during the last breakage Redis needed 4/5GB to buffer commands after the snapshot..

From http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.Redis.Versions.html:

Redis Versions Prior to 2.8.22

Redis backup and synchronization in versions prior to 2.8.22, is a three-step process.

Fork, and in the background process, serialize the cluster data to disk. This creates a point-in-time snapshot.

In the foreground, accumulate a change log in the client output buffer.

Important
If the change log exceeds the client output buffer size, the backup or synchronization fails. For more information, see Ensuring You Have Sufficient Memory to Create a Redis Snapshot.

Finally, transmit the cache data and then the change log to the replica cluster.

So afaics the client output buffer size is not related to the size of the rdb but to the amount of requests/commands landing to Redis from the moment in which it creates the rdb file onward.

Yes, it seems the client output buffer has nothing to do with the snapshot size (good to learn that).
Last time I checked, the issue was the soft limit and not the hard limit. See rOPUPb25fc62c1395 and rOPUP866d420f636. Back then it was rbd2005 unable to fetch all the data due to heavy load and the timeout of 60 secs being reached with > 200MB in the buffer (but less than < 500MB) as it seems. This does not seem to be what happened this time around though, judging from rdb2005 grafana and rdb1007 grafana

It is still weird to think that during the last breakage Redis needed 4/5GB to buffer commands after the snapshot..

I can't help but wonder how EVALSHA and Lua blocking redis are related to all of this (they may very well not be at all, or might even be beneficial, acting as a limiter?)

Mentioned in SAL (#wikimedia-operations) [2017-04-18T14:04:35Z] <elukey> executed CONFIG SET appendfsync no on rdb2005:6479 to test if fsync stalls affect replication - T159850

Mentioned in SAL (#wikimedia-operations) [2017-04-18T14:11:48Z] <elukey> executed CONFIG SET appendfsync everysec (default) to restore defaults on rdb2005:6479- T159850

Mentioned in SAL (#wikimedia-operations) [2017-04-18T14:22:12Z] <elukey> executed config set client-output-buffer-limit "normal 0 0 0 slave 2147483648 2147483648 300 pubsub 33554432 8388608 60" on rdb2005:6749 as attempt to solve slave lagging - T159850

Now that I put more thoughts on it, the client-output-buffer will need to cope with the amount of data that will take to rdb100X:YYYY to send a rdb snapshot to its codfw replica, so probably we'd need to focus on why the rdb snapshot are so big (as Joe pointed out before).

We tried to check Redis keys but it is not that straightforward to pinpoint what is causing the massive size, more efforts needed :)

elukey@rdb2005:~$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6380.conf)" -p 6479 --bigkeys -i 0.1                                                                                                                                                 [25/25]

# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).

[00.00%] Biggest string found so far 'incubatorwiki:jobqueue:htmlCacheUpdate:rootjob:00ae1b398f46300f3cd6910c08a6b4ad6a468c0d' with 14 bytes
[00.12%] Biggest list   found so far 'frwiktionary:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed' with 8 items
[00.30%] Biggest hash   found so far 'lawiki:jobqueue:wikibase-addUsagesForPage:h-data' with 1 fields
[01.07%] Biggest zset   found so far 'frwiki:jobqueue:cirrusSearchIncomingLinkCount:z-claimed' with 1 members
[01.48%] Biggest zset   found so far 'arwiki:jobqueue:cirrusSearchIncomingLinkCount:z-delayed' with 2 members
[01.50%] Biggest zset   found so far 'chapcomwiki:jobqueue:ThumbnailRender:z-abandoned' with 4 members
[01.56%] Biggest hash   found so far 'frwiktionary:jobqueue:RestbaseUpdateJobOnDependencyChange:h-data' with 8 fields
[02.33%] Biggest hash   found so far 'commonswiki:jobqueue:gwtoolsetUploadMediafileJob:h-data' with 367 fields
[02.46%] Biggest zset   found so far 'dewiki:jobqueue:cirrusSearchCheckerJob:z-delayed' with 27 members
[03.40%] Biggest list   found so far 'iswiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed' with 15 items
[04.40%] Biggest hash   found so far 'commonswiki:jobqueue:refreshLinks:h-idBySha1' with 8816 fields
[09.39%] Biggest list   found so far 'enwiktionary:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed' with 1984 items
[10.52%] Biggest zset   found so far 'metawiki:jobqueue:translationNotificationJob:z-abandoned' with 900 members
[11.36%] Biggest set    found so far 'jobqueue:aggregator:s-wikis:v2' with 946 members
[15.00%] Biggest list   found so far 'commonswiki:jobqueue:refreshLinks:l-unclaimed' with 8825 items
[25.70%] Biggest hash   found so far 'ocg_job_status' with 675119 fields
[35.17%] Sampled 1000000 keys so far
[70.34%] Sampled 2000000 keys so far

-------- summary -------

Sampled 2840970 keys in the keyspace!
Total key length in bytes is 233258629 (avg len 82.11)

Biggest string found 'incubatorwiki:jobqueue:htmlCacheUpdate:rootjob:00ae1b398f46300f3cd6910c08a6b4ad6a468c0d' has 14 bytes
Biggest   list found 'commonswiki:jobqueue:refreshLinks:l-unclaimed' has 8825 items
Biggest    set found 'jobqueue:aggregator:s-wikis:v2' has 946 members
Biggest   hash found 'ocg_job_status' has 675119 fields
Biggest   zset found 'metawiki:jobqueue:translationNotificationJob:z-abandoned' has 900 members

2840112 strings with 39761567 bytes (99.97% of keys, avg size 14.00)
77 lists with 27043 items (00.00% of keys, avg size 351.21)
2 sets with 1159 members (00.00% of keys, avg size 579.50)
596 hashs with 741372 fields (00.02% of keys, avg size 1243.91)
183 zsets with 3088 members (00.01% of keys, avg size 16.87)

ocg_job_status seems by far the biggest key, and I am pretty sure that doing DEBUG OBJECT will give us a huge serialized value.

@cscott Hi! Sorry to bring you in this thread but I am wondering if we could use any of the maintenance scripts listed in https://wikitech.wikimedia.org/wiki/OCG#Maintenance_scripts to check/clear why the ocg_job_status jobqueue is so big? It seems to be one of the possible candidates of why our Redis RDB snapshots got bigger than 1GB, causing some issues while replicating to codfw.

This is not needed now since we are preparing for the Mediawiki switchover but we might want to fix the issue during the next days.

That's probably "as expected" behavior -- the ocg_job_status contains one entry for every cached PDF, and we cache quite a few of them. I'm not certain exactly what the "size" of the hash means for you -- I could review the fields that are stored in redis, perhaps some of them can be purged once the PDF generation is complete?

If you look at ocg.pdf.job_queue_length in https://graphite.wikimedia.org/ you'll see that the job queue is actually quite small. The ocg.pdf.status_objects metric is the # of cached PDFs, and that is consistently between 600k and 700k objects, with old entries expiring and being garbage collected about once a day.

Thanks a lot for the explanation @cscott! I'll try to add more details. The rdb eqiad shards periodically replicates to codfw via the Redis protocol, that consists in taking a snapshot (Redis RDB file), send it to codfw and then in the meantime buffer all the commands coming after (and finally send them over too). This is not always true because Redis offers also partial synchronization, but sometimes, if the connection between master/slave is impaired for some reason, Redis forces the slave to ask for a fully sync.

What we are observing now is that the eqiad Redis shards are producing big snapshots (in this case, 1.5GB) that takes time to get flushed to codfw and causes the buffer for incoming commands to get filled up, causing replication failures and cascading replications over time (this is a better explanation https://redislabs.com/blog/the-endless-redis-replication-loop-what-why-and-how-to-solve-it/).

From T159850#3190116 it seems that the bytes that Redis uses to store the ocg_job_status hashmap are quite a few, and I suspect that they might contribute for the big RDB snapshots size. This might not be a new issue but something that now is causing some replication failures raising ops attention.

So my question is if it is possible to reduce the ocg_job_status' number of elements to see if snapshots gets smaller (confirming or not my theory).

Today Joe added some Redis instance to the ocg hosts to decouple it from the job queues, hopefully we shouldn't see this error anymore. The recent alarms were caused mostly by rdb2005, that was a replica of rdb1007, the rdb host hosting the big ocg hash table.

I am inclined to close again this task and re-open it only if the issue re-occurs.

Mentioned in SAL (#wikimedia-operations) [2017-05-04T10:22:07Z] <elukey> executed DEL ocg_job_status on rdb1007:6379 (new ocg_job_status hash is stored on the ocg* hosts) - T159850