Page MenuHomePhabricator

Nutcracker needs to automatically recover from MC failure - rebalancing issues
Closed, ResolvedPublic

Description

Nutcracker (memcached proxy) needs to be tested with more failure modes and fixed to automatically recover from memcached failures without requiring manual intervention.

Also, since we need it for T83551, we need to understand how to "hot swap" servers so that changing one IP does not mean rebalancing all the keys in the cluster.

Details

Related Gerrit Patches:

Event Timeline

chasemp created this task.Feb 5 2015, 7:58 PM
chasemp raised the priority of this task from to Needs Triage.
chasemp updated the task description. (Show Details)
chasemp added a subscriber: chasemp.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2015, 7:58 PM
chasemp triaged this task as Medium priority.Feb 5 2015, 7:59 PM
chasemp set Security to None.
Joe added a subscriber: Joe.Feb 7 2015, 4:13 PM

Twemproxy docs state:
"Enabling auto_eject_hosts: ensures that a dead server can be ejected out of the hash ring after server_failure_limit: consecutive failures have been encountered on that said server. A non-zero server_retry_timeout: ensures that we don't incorrectly mark a server as dead forever especially when the failures were really transient. The combination of server_retry_timeout: and server_failure_limit: controls the tradeoff between resiliency to permanent and transient failures."

And also:
server_retry_timeout: The timeout value in msec to wait for before retrying on a temporarily ejected server, when auto_eject_host is set to true. Defaults to 30000 msec

We actually set:
server_failure_limit: 3
auto_eject_hosts: true

so in theory twemproxy should retry to connect after 30 seconds. In practice, we noticed that until twemproxy was restarted memcached requests were not working.

We also set "timeout" to 250, so no dead connection hanging could've been responsible for this. It genuinely looks like a bug.

Joe renamed this task from Nutcracker needs to automatically recover from MC failure to Nutcracker needs to automatically recover from MC failure - rebalancing issues.Feb 9 2015, 9:07 AM
Joe claimed this task.
Joe raised the priority of this task from Medium to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe added a comment.Feb 9 2015, 12:00 PM

So, for rebalancing I just tested a valid workaround; now we define:

servers:
 - 10.0.0.2:11211:1 
 - 10.0.0.3:11211:1  
 - 10.0.0.4:11212:1

which constructs hashes based on host:port; it is however possible to define a label to do consistent hashing on if we want; the trick to do "hot swapping" is to use such host:port pairs as labels as well - so

servers:
 - 10.0.0.2:11211:1 10.0.0.2:11211
 - 10.0.0.3:11211:1 10.0.0.3:11211
 - 10.0.0.4:11212:1  10.0.0.4:11212

so that if we need to insert a new server from a different VLAN and we just want to do "hot swapping" and not a full rebalance of the cluster we just need:

servers:
 - 10.0.0.2:11211:1 10.0.0.2:11211
 - 10.100.12.43:11211:1 10.0.0.3:11211
 - 10.0.0.4:11212:1  10.0.0.4:11212

(so leaving the "label" intact). I did experiment on this in labs and it works - switching back and forth between backends where I had previously injected different value for the same key did result in getting the different value from each server. Just changing the IP without setting labels did result in not being able to fetch the key (due to cluster rebalancing).

Joe added a comment.Feb 9 2015, 12:42 PM

Changing the label of 1 out of 4 servers in a cluster (this would be equivalent to changing one IP in our current configuration) makes 30% of the keys unavailable.

So projecting from this result, we could expect that around 7-10% of the keys will become unavailable for each swap.

I think this is a price we can afford to pay, if we use this occasion to configure things correctly (i.e. we introduce labels for all new servers)

Joe added a comment.EditedFeb 9 2015, 3:05 PM

Testing again, this time for connection failures. I tried causing network failures on a cluster I populated first with 1000 keys, by issuing on the client machine:

iptables -A OUTPUT -p TCP -d 10.0.0.4 --dport 11211 -j DROP

and then doing the same on one of the memcached servers

iptables -A INPUT -p TCP --dport 11212 -j DROP

and in both cases my python clients became incapable of reading any key from any machine in the cluster (!!!!).

This seems to hold only if you don't reconnect to nutcracker and continue to use the same python client - so this is a client flaw. Doing a better test by reconnecting to memcached with every request, I didn't notice such a behaviour and thus nutcracker itself behaves correctly:

  • It returns an error whenever the server is down
  • It re-inserts the server into rotation if an automatic reconnection check works down the road.

I strongly suspect this may have to do with the php client library and how we use it as well.

Joe added a comment.Feb 9 2015, 4:06 PM

HHVM by default generates a connection pool to memcached. In my test from the cli, I saw two connection threads.

Once I dropped connections to a backend, one of the two connections was left in a bad state, continuing to spit out errors for EVERY key request, while the other one was left alone and continued to operate correctly (as no connection was truncated in-flight, probably).

All connection recovered after some time once I flushed IPTables.

So this still doesn't reproduce what we've seen live on the site. Probably such an investigation would need a) a full HHVM fastcgi setup and b) a suitable load

faidon lowered the priority of this task from Unbreak Now! to High.Feb 19 2015, 6:10 PM
faidon added a subscriber: faidon.
Joe added a comment.Feb 20 2015, 7:30 AM

For the record: yesterday I inserted willingly a server in the nutcracker prod config while it was still rebooting. On 80% of the cluster, the server was immediately reachable within 30 seconds of coming online; on the remainder, a nutcracker restart was needed. So it seems this is actually a bug due to some sort of race condition under high load and that's why it's hard to reproduce in "lab".

RandomDSdevel added a subscriber: RandomDSdevel.
ori added a subscriber: ori.Jul 16 2015, 7:14 PM

@Joe Any update on this?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 16 2015, 7:14 PM
Nemo_bis added a subscriber: Nemo_bis.
Joe added a comment.EditedJul 17 2015, 8:28 AM

@ori nope, I never had a way to reproduce the issue so it wasn't possible to do any organized analysis.

We have repeated evidence of servers suddenly (and for no apparent good reason) ending up not speaking to backends. I'm pretty sure it's a bug in nutcracker but I had no occasion to pin it down properly.

ori added a comment.Jul 29 2015, 9:13 PM

It's because we don't set server_retry_timeout. Per Nutcracker's recommendation document, "a non-zero server_retry_timeout ensures that we don't incorrectly mark a server as dead forever especially when the failures were really transient".

Put differently: when server_retry_timeout is zero (which is the default), a transient failure to reach a server will cause it to be permanently ejected from the hash ring.

Change 227881 had a related patch set uploaded (by Ori.livneh):
nutcracker: prevent servers from being marked as dead indefinitely

https://gerrit.wikimedia.org/r/227881

Change 227881 merged by Ori.livneh:
nutcracker: prevent servers from being marked as dead indefinitely

https://gerrit.wikimedia.org/r/227881

ori closed this task as Resolved.Jul 30 2015, 5:08 AM
ori reopened this task as Open.Jul 30 2015, 5:48 AM

@Joe noticed that the default value in the source code (pace the documentation) was already 30k.

Joe removed Joe as the assignee of this task.Apr 4 2016, 7:49 AM
Joe lowered the priority of this task from High to Medium.
Joe added a comment.Apr 4 2016, 8:38 AM

I release this ticket as:

  1. I don't have ideas/time to work on it
  2. This is not happening lately
elukey added a subscriber: elukey.Jul 22 2016, 3:16 PM
greg added a subscriber: greg.Sep 29 2016, 7:42 PM

This follow-up task from an incident report has not been updated recently. If it is no longer valid, please add a comment explaining why. If it is still valid, please prioritize it appropriately relative to your other work. If you have any questions, feel free to ask me (Greg Grossmeier).

elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.May 2 2017, 4:28 PM
elukey closed this task as Resolved.Jan 7 2019, 3:20 PM
elukey claimed this task.

Closing this since nutcracker has been replaced by mcrouter. Please re-open if I am missing anything :)