Page MenuHomePhabricator

mcrouter does not remove a memcached shard from consistent hashing when timeouts happen
Open, NormalPublic

Description

In T203786 we discovered that mcrouter does not work as we thought when a shard is marked as temporary down due to too many timeouts registered. We are used to what nutcracker was doing, namely removing the shard from consistent hashing transparently to Mediawiki, meanwhile mcrouter does not do that as it was confirmed in https://github.com/facebook/mcrouter/issues/271.

This is what happens with mcrouter:

  • Three Ten 1s timeouts are reached for a specific shard, so it is marked as TKO and no more traffic is routed to it as protection.
  • All the GET/SET/etc.. for the keys handled by the failing shard are not re-hashed elsewhere, therefore they immediately lead to errors during this timeframe.
  • mcrouter waits 3s before starting to send health checks to the shard, and it starts sending traffic back to it only when the first health checks passes.

Opening this task to discuss the solutions proposed in the facebook's gh issue or to come up with a different strategy about how to handle memcached timeouts.

Event Timeline

elukey triaged this task as Normal priority.Nov 7 2018, 7:41 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2018, 7:41 AM

In the facebook's gh issue the following was mentioned:

So the recommended solution is to have a separate pool of servers for backup (a.k.a. "gutter"). It can be a much smaller pool. The failover route should be set with two children - the first child would be the normal (full size) pool, and the second child would be the gutter (small pool) pool route. This way, when a single host fails (either with TKO or any other error), mcrouter will fall back to the gutter pool, and there, the key will be consistently hashed to one of those servers. There are some consistency issues though like values lingering in the gutter pool. One way to address that is by using FailoverWithExptimeRoute.

The idea is to have a separate pool of memcached shards decided only to failover when a TKO happens, completely separated from the main pool. This suggestion could be a good one to test and possibly implement in our production environment, but it needs to be discussed more broadly first :)

aaron added a comment.Nov 20 2018, 8:04 PM

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Is there existing hardware for this, or would that need an order?

If we do go this route, we should have monitoring/alerting on access to the gutter boxes.

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.
Is there existing hardware for this, or would that need an order?

I am going to wait for @Joe to open a new task or update this one with a more precise list of things to do, before thinking about new hw we absolutely need to test extensively mcrouter and decide the config to use :)

If we do go this route, we should have monitoring/alerting on access to the gutter boxes.

Definitely, and mcrouter offers metrics about how many times failover have been used for example (but not sure if the prometheus exporter gets them, in case we'll need to send a pull request upstream). This is also something that is part of the discussion that I mentioned above :)

Thanks for the suggestions!

elukey moved this task from Backlog to Stalled on the User-Elukey board.Dec 5 2018, 2:10 PM
jijiki added a subscriber: jijiki.Dec 17 2018, 2:38 PM
elukey moved this task from Stalled to Backlog on the User-Elukey board.Feb 27 2019, 9:17 AM

Very interesting reading about FailoverRoute and deletes: https://github.com/facebook/mcrouter/issues/105

elukey moved this task from Backlog to Stalled on the User-Elukey board.Apr 26 2019, 8:53 AM
elukey updated the task description. (Show Details)Jun 14 2019, 10:04 AM

The parent task has been completed, we now have way less and sporadic TKO events and requests to memcached dropped by mcrouter.

We should now have a discussion (Performance and SRE teams probably) about what kind of solution is acceptable in the medium to long term. There has been an interesting discussion in Dublin during the SRE Summit to use mcrouter as example of "service" that could be handled in a more "formal" SRE way, setting an SLO and error budget for example (https://landing.google.com/sre/sre-book/chapters/service-level-objectives/). The error budget could be a driving factor to decide wether or or not the gutter pool is needed, or if sporadic events could be tolerable (in these events we should include reboots, hardware failure, etc.. not only application problems).

I would be interested in having a chat with the Performance and SRE teams together, let me know if you like the idea or not.

aaron added a comment.Jun 17 2019, 8:20 AM

Some sort of meeting sounds reasonable.

What I'd like to discuss in the meeting (or even in here) is the following:

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Intuitively I agree, but I don't have a clear picture of what are the failure scenarios and how bad it would be for stability. For example, let's imagine the following scenarios:

  1. mc1029 (or any shard) is down for maintenance (hw failure, reboots, etc..). How long it would be acceptable before thinking about doing something like removing it from the mcrouter's config? Also, would it be an acceptable solution? It would probably cause a re-hash of a lot of keys..
  2. How many shard failures can we tolerate before starting to hit problems (for example, unsustainable increased db load, etc..)
  3. How slow should a roll restart of memcached be in case of maintenance? (sw upgrade, reboot of the hosts for kernel upgrade, etc..)

Adding also @Marostegui to the conversation as FYI since they might be interested in participating (I briefly mentioned the subject to him during the SRE offsite).

Joe added a comment.Jun 17 2019, 8:58 AM

What I'd like to discuss in the meeting (or even in here) is the following:

It definitely seems like something worth doing. Having the potential for high use cache keys becoming unusable for undefined periods of time is too much of a stability concern.

Intuitively I agree, but I don't have a clear picture of what are the failure scenarios and how bad it would be for stability. For example, let's imagine the following scenarios:

  1. mc1029 (or any shard) is down for maintenance (hw failure, reboots, etc..). How long it would be acceptable before thinking about doing something like removing it from the mcrouter's config? Also, would it be an acceptable solution? It would probably cause a re-hash of a lot of keys..
  2. How many shard failures can we tolerate before starting to hit problems (for example, unsustainable increased db load, etc..)
  3. How slow should a roll restart of memcached be in case of maintenance? (sw upgrade, reboot of the hosts for kernel upgrade, etc..)

Adding also @Marostegui to the conversation as FYI since they might be interested in participating (I briefly mentioned the subject to him during the SRE offsite).

Regarding 1 - That should be based solely on the error budget we set. 2 - In the past we were able to recover from a full cache wipe within one hour, but I guess we can't tolerate more than one consistent shard failure, and that that should be solved in a timely manner. As far as 3 goes, I'd say by rule of thumb we'd need to wait for a restarted server to get back to its usual cache hit rate before we restart the next.

Joe added a comment.Jun 17 2019, 9:02 AM

Some sort of meeting sounds reasonable.

We will first need to write down the SLIs/SLOs and gather some data about them, so that we can get into the meeting with reasonable numbers regarding the expected error budget. We will also need to come to an agreement about capacity.

So I'd postpone organizing it until the Service Operations team is able to form a meaningful proposal.

elukey moved this task from Stalled to Mcrouter/Memcached on the User-Elukey board.Jul 5 2019, 7:02 AM

Adding some info about racking: https://netbox.wikimedia.org/search/?q=mc10&obj_type=

  • mc1019-23 are in rack A6
  • mc1024-27 are in rack B6
  • mc1028-32 are in rack C5
  • mc1033-36 are in rack D4

This means that a rack down event could bring down 4 shards at the same time. We will probably survive but it is not clear to me what the impact to databases/etc.. would be in this case.

elukey added a subtask: Unknown Object (Task).Sep 25 2019, 7:43 AM