Page MenuHomePhabricator

rbf1001 and rbf1002 are timing out / dropping clients for Redis
Closed, ResolvedPublicPRODUCTION ERROR

Description

@Krenair poked -ops to ask if anyone knew that there were lots of redis failures in the logs on flourine such as:

Mar 13 00:39:45 mw1243:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
Mar 13 00:39:47 mw1241:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
Mar 13 00:39:48 mw1242:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
Mar 13 00:39:51 mw1240:  #012Warning: Failed connecting to redis server at 10.64.0.163: Connection timed out
Mar 13 00:40:23 mw1228:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
Mar 13 00:40:23 mw1161:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
Mar 13 00:40:42 mw1253:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out
Mar 13 00:40:43 mw1174:  #012Warning: Failed connecting to redis server at 10.64.0.163: Connection timed out
Mar 13 00:40:50 mw1234:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out

First instance from the current log at that time:

Mar 12 07:14:37 mw1079:  #012Warning: Failed connecting to redis server at 10.64.0.162: Connection timed out

The seeming relevant log from redis is:

[15410] 13 Mar 00:36:16.094 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

Redis is essentially choking on keeping up with AOF and it is blocking / dropping clients randomly. We have seen very little evidence in icinga but...

root@rbf1001:~# redis-cli --latency -h 127.0.0.1 -p 6379
min: 0, max: 91, avg: 11.14 (1184 samples)^C
root@rbf1001:~# redis-cli --latency -h 127.0.0.1 -p 6379
min: 0, max: 79, avg: 14.89 (805 samples)^C

That's not great I think^

Interesting links:

http://engineering.sharethrough.com/blog/2013/06/06/how-redis-took-us-offline-and-what-we-did-about-it/
https://github.com/antirez/redis/issues/641

Seems like these have been in place since Sept https://phabricator.wikimedia.org/rOPUP69322ed601a0634ef694d922b7e17b5cadb086ca

From what I can gather in IRC it seems like this is an efficiency mechanism that fails back to standard lookup so I am not like sounding the alarms right now or anything but right now these boxes are only going to degrade further.

Thanks @GWicke, @Legoktm, @Krenair, @bd808

Event Timeline

chasemp assigned this task to aaron.
chasemp raised the priority of this task from to High.
chasemp updated the task description. (Show Details)
chasemp added a project: acl*sre-team.
chasemp added subscribers: chasemp, ori, GWicke and 3 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2015, 1:18 AM
demon raised the priority of this task from High to Unbreak Now!.Mar 13 2015, 3:30 PM
demon added a subscriber: demon.
greg added a subscriber: greg.

This is spammy spammy spammy:

for rbf1001 we've had 44200 error logs in the past 15 minutes
rbf1002 (I presume, it's 10.64.0.163) is less but still ~60 in the last 15 minutes

greg set Security to None.Mar 13 2015, 4:46 PM
greg added a subscriber: aaron.

gave aaron a call and left a vm

called ori who knew understood the workload and he said he was minutes away from the office and will get into it

thanks @ori

for posterity

fluorine:~# grep '10.64.0.162: Connection timed out'  /a/mw-log/hhvm.log | wc -l
2241856

I disabled puppet with a reference to this issue, turned off AOF and restarted REDIS. It seems to have helped in one case and not much in the other.

Failures in the last 10K

root@fluorine:~# tail -n 10000 /a/mw-log/hhvm.log | grep 'Connection timed out' | grep 0.162 | wc -l
9968
root@fluorine:~# tail -n 10000 /a/mw-log/hhvm.log | grep 'Connection timed out' | grep 0.163 | wc -l
19

It seems teh master/slave status is defunct, and if I understand correctly @aaron thinks:

that would make some sense, the code would get stuck in an update loop on 1/2 the filter checks

I'm not sure this could have ever been working as the same redis role is applied in both cases, so they both are set as ignorant masters?

node /^rbf100[1-2]\.eqiad\.wmnet/ {
    role db::redis
    include admin

    # Background save may fail under low memory condition unless
    # vm.overcommit_memory is 1.
    sysctl::parameters { 'vm.overcommit_memory':
        values => { 'vm.overcommit_memory' => 1, },
    }
}

@aaron seems to have turned this off for now. No connected clients and 0 error logs. Good odds the lack of master/slave status combined with AOF on the "slave" will clear this up, we'll need to coordinate further.

I was told aaron went to lunch so I guess this is on hold. This needs an incident report per @greg and should not be re-enabled until an ops person with access is around to coordinate to monitor and debug.

chasemp added a comment.EditedMar 13 2015, 8:53 PM

side note, I think the replication config was moved to hiera and maybe this just isn't working?

https://phabricator.wikimedia.org/rOPUP457d58535e9b3e49e0eb7a91c42b76316f84c44f

The currently enforced slave config from puppet for rbf1002:

-slaveof 10.64.0.162 6379
+# slaveof <masterip> <masterport>

@Joe ideas?

Joe added a comment.EditedMar 16 2015, 2:25 PM

@chasemp what exactly doesn't work in the hiera change? I see the config on rbf1002 server seems to be correct at the moment.

In T92591#1121388, @Joe wrote:

@chasemp what exactly doesn't work in the hiera change? I see the config on rbf1002 server seems to be correct at the moment.

Caught up via IRC but for posterity the change in question was hard coded from Friday testing.

chasemp lowered the priority of this task from Unbreak Now! to High.Mar 24 2015, 10:21 PM
Joe added a comment.Mar 26 2015, 6:39 AM

I see appendonly is manually disabled on the host and puppet is disabled again. I am pretty sure I fixed the replication glitch in puppet... @chasemp can you update the status of this?

In T92591#1152373, @Joe wrote:

I see appendonly is manually disabled on the host and puppet is disabled again. I am pretty sure I fixed the replication glitch in puppet... @chasemp can you update the status of this?

I think it was never enabled and run post your fixes. I'm not sure what @aaron wants to do here, I had let it sit as I it was my understanding he is moving away from the master/slave model entirely? Really not sure, but this is assigned to him so I'll let him chime in.

In the mean time, I enabled and ran puppet and things seem normal.

Two thoughts about rbf1002 changes from puppet:

This will break if DNS ever has issues (previously I have used IP's for redis replication), just a note:

-slaveof 10.64.0.162 6379
+# slaveof <masterip> <masterport>
+slaveof rbf1001 6379

If this is going to be a slave it seems wasteful to do persistence on it, and the value of it should probably be looked at for rdf1001 also IMO:

-appendonly no
+appendonly yes

Gilles added a subscriber: Gilles.Apr 2 2015, 12:23 PM
bd808 moved this task from Backlog to Done on the MediaWiki-Core-Team board.Apr 7 2015, 4:41 PM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM