Page MenuHomePhabricator

mcrouter does not remove a memcached shard from consistent hashing when timeouts happen
Closed, ResolvedPublic

Description

In T203786 we discovered that mcrouter does not work as we thought when a shard is marked as temporary down due to too many timeouts registered. We are used to what nutcracker was doing, namely removing the shard from consistent hashing transparently to Mediawiki, meanwhile mcrouter does not do that as it was confirmed in https://github.com/facebook/mcrouter/issues/271.

This is what happens with mcrouter:

  • Three Ten 1s timeouts are reached for a specific shard, so it is marked as TKO and no more traffic is routed to it as protection.
  • All the GET/SET/etc.. for the keys handled by the failing shard are not re-hashed elsewhere, therefore they immediately lead to errors during this timeframe.
  • mcrouter waits 3s before starting to send health checks to the shard, and it starts sending traffic back to it only when the first health checks passes.

Opening this task to discuss the solutions proposed in the facebook's gh issue or to come up with a different strategy about how to handle memcached timeouts.

Event Timeline

elukey triaged this task as Medium priority.Nov 7 2018, 7:41 AM
elukey created this task.

In the facebook's gh issue the following was mentioned:

So the recommended solution is to have a separate pool of servers for backup (a.k.a. "gutter"). It can be a much smaller pool. The failover route should be set with two children - the first child would be the normal (full size) pool, and the second child would be the gutter (small pool) pool route. This way, when a single host fails (either with TKO or any other error), mcrouter will fall back to the gutter pool, and there, the key will be consistently hashed to one of those servers. There are some consistency issues though like values lingering in the gutter pool. One way to address that is by using FailoverWithExptimeRoute.

The idea is to have a separate pool of memcached shards decided only to failover when a TKO happens, completely separated from the main pool. This suggestion could be a good one to test and possibly implement in our production environment, but it needs to be discussed more broadly first :)

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Is there existing hardware for this, or would that need an order?

If we do go this route, we should have monitoring/alerting on access to the gutter boxes.

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Is there existing hardware for this, or would that need an order?

I am going to wait for @Joe to open a new task or update this one with a more precise list of things to do, before thinking about new hw we absolutely need to test extensively mcrouter and decide the config to use :)

If we do go this route, we should have monitoring/alerting on access to the gutter boxes.

Definitely, and mcrouter offers metrics about how many times failover have been used for example (but not sure if the prometheus exporter gets them, in case we'll need to send a pull request upstream). This is also something that is part of the discussion that I mentioned above :)

Thanks for the suggestions!

The parent task has been completed, we now have way less and sporadic TKO events and requests to memcached dropped by mcrouter.

We should now have a discussion (Performance and SRE teams probably) about what kind of solution is acceptable in the medium to long term. There has been an interesting discussion in Dublin during the SRE Summit to use mcrouter as example of "service" that could be handled in a more "formal" SRE way, setting an SLO and error budget for example (https://landing.google.com/sre/sre-book/chapters/service-level-objectives/). The error budget could be a driving factor to decide wether or or not the gutter pool is needed, or if sporadic events could be tolerable (in these events we should include reboots, hardware failure, etc.. not only application problems).

I would be interested in having a chat with the Performance and SRE teams together, let me know if you like the idea or not.

Some sort of meeting sounds reasonable.

What I'd like to discuss in the meeting (or even in here) is the following:

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Intuitively I agree, but I don't have a clear picture of what are the failure scenarios and how bad it would be for stability. For example, let's imagine the following scenarios:

  1. mc1029 (or any shard) is down for maintenance (hw failure, reboots, etc..). How long it would be acceptable before thinking about doing something like removing it from the mcrouter's config? Also, would it be an acceptable solution? It would probably cause a re-hash of a lot of keys..
  2. How many shard failures can we tolerate before starting to hit problems (for example, unsustainable increased db load, etc..)
  3. How slow should a roll restart of memcached be in case of maintenance? (sw upgrade, reboot of the hosts for kernel upgrade, etc..)

Adding also @Marostegui to the conversation as FYI since they might be interested in participating (I briefly mentioned the subject to him during the SRE offsite).

What I'd like to discuss in the meeting (or even in here) is the following:

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Intuitively I agree, but I don't have a clear picture of what are the failure scenarios and how bad it would be for stability. For example, let's imagine the following scenarios:

  1. mc1029 (or any shard) is down for maintenance (hw failure, reboots, etc..). How long it would be acceptable before thinking about doing something like removing it from the mcrouter's config? Also, would it be an acceptable solution? It would probably cause a re-hash of a lot of keys..
  2. How many shard failures can we tolerate before starting to hit problems (for example, unsustainable increased db load, etc..)
  3. How slow should a roll restart of memcached be in case of maintenance? (sw upgrade, reboot of the hosts for kernel upgrade, etc..)

Adding also @Marostegui to the conversation as FYI since they might be interested in participating (I briefly mentioned the subject to him during the SRE offsite).

Regarding 1 - That should be based solely on the error budget we set. 2 - In the past we were able to recover from a full cache wipe within one hour, but I guess we can't tolerate more than one consistent shard failure, and that that should be solved in a timely manner. As far as 3 goes, I'd say by rule of thumb we'd need to wait for a restarted server to get back to its usual cache hit rate before we restart the next.

Some sort of meeting sounds reasonable.

We will first need to write down the SLIs/SLOs and gather some data about them, so that we can get into the meeting with reasonable numbers regarding the expected error budget. We will also need to come to an agreement about capacity.

So I'd postpone organizing it until the Service Operations team is able to form a meaningful proposal.

Adding some info about racking: https://netbox.wikimedia.org/search/?q=mc10&obj_type=

  • mc1019-23 are in rack A6
  • mc1024-27 are in rack B6
  • mc1028-32 are in rack C5
  • mc1033-36 are in rack D4

This means that a rack down event could bring down 4 shards at the same time. We will probably survive but it is not clear to me what the impact to databases/etc.. would be in this case.

elukey added a subtask: Unknown Object (Task).Sep 25 2019, 7:43 AM
elukey added a subtask: Unknown Object (Task).Nov 22 2019, 3:51 PM

We have ordered the hosts, 3 for eqiad and 3 for codfw (see related subtasks).

The next step is to test in labs or similar how mcrouter behaves with failover routes, and then if everything goes as planned as the new hosts as failover pool.

One thing that I would do now though is adding a failover pool to the codfw proxies before touching the eqiad ones. Currently for a eqiad mcrouter on mw1* there is no difference between a mc10XX target and a mw2XXX target, since both are capable of accepting/parsing commands following the memcached protocol. If a mw2XXX host acting as mcrouter proxy goes down for some reason (maintenance, hw failure, etc..) then all the mw1* hosts will set it as TKO (due to the timeouts) and traffic will fail to replicate, generating exceptions and noise etc.. We could add 4 mw2XXX hosts as failover pool for the current codfw proxies, in order to make replication more resilient and also to test mcrouter failover in a less impactful scenario before changing the config of the eqiad traffic.

jijiki removed a subtask: Unknown Object (Task).Jan 5 2020, 7:58 PM
elukey claimed this task.

This can be closed in my opinion, we have already worked on the Gutter pool, see parent task and related subtasks :)