Page MenuHomePhabricator

mcrouter memcached flapping in gutter pool
Closed, ResolvedPublic

Description

During an incident on 2020-06-08 we observed the gutter pool taking over for a memcached server under stress, but we also saw flapping between the server and the gutter pool server when the server recovered and then was hit again. The flapping causes disruptions in the service and we should investigate how to minimize the flapping.

In the 2020-06-08 incident Giuseppe stopped the flapping by firewalling the affected memcached server mc1029 for the duration of the incident.

memcache perfromance: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1591588800000&to=1591592399000

Gutter Pool: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=1591588800000&to=1591592399000&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All

Mcrouter: https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1591588800000&to=1591592399000

Event Timeline

Adding some info about how mcrouter behaves at the moment :)

Every mcrouter takes independent choices about what shard is "healthy" and what it is not (TKO), using the following criteria:

  • If 10x1s timeouts are registered in a row (--timeouts-until-tko) for a specific mc10xx shard then it is marked as "TKO" and traffic is shifted to the Gutter pool.
  • If a shard is marked as TKO, mcrouter will wait 3s (--probe-timeout-initial) before starting to send health probes to it. Once a health probe is successful, the shard is unmarked from the TKO state and the traffic is restored (Gutter pool abandoned).

The above settings were a compromise made before having the gutter pool to avoid ending up in a TKO too soon (and blackhole traffic), something that was happening very frequently in the past.

@RLazarus @CDanis maybe as interim solution, while we think about a more "final" solution, we could change --probe-timeout-initial from 3s to something like 30/60/300s? To avoid constant flaps if another outage occurs.. What do you think?

Change 607026 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mediawiki::mcrouter_wancache: send probe after 60s

https://gerrit.wikimedia.org/r/607026

Summarizing here a conversation @elukey and I had in #wikimedia-serviceops:

Currently when we fail over to the gutterpool (via FailoverWithExptimeRoute) we switch completely from our normal PoolRoute to the gutter PoolRoute. I proposed that instead, we set the FailoverWithExptimeRoute's failure path to an AllFastestRoute that would send traffic to the gutter pool and to the original host. That way, the gutter pool continues to actually serve the traffic, but if the problem is that the main host can't handle the traffic, we continue to knock it over so that it remains in TKO until it's actually able to handle the load.

Luca correctly pointed out two flaws in that approach:

  1. If the shard is TKO, mcrouter probably won't send it traffic anyway, even with the AllFastestRoute, so it would still recover and fail again as it currently does.
  2. Sending all that network traffic might cause harm beyond just the single memcache host.

To address flaw #1, I might dig into the code to see if there's a way to send traffic in this specific case anyway, e.g. a route handle that disregards the TKO. But flaw #2 is a lot more of a dealbreaker.

Change 607026 merged by Elukey:
[operations/puppet@production] profile::mediawiki::mcrouter_wancache: send probe after 60s

https://gerrit.wikimedia.org/r/607026

Mentioned in SAL (#wikimedia-operations) [2020-07-07T15:27:06Z] <elukey> root-tmux on cumin1001 - cumin 'c:profile::mediawiki::mcrouter_wancache' '/usr/local/sbin/restart-mcrouter' -b 2 -s 5 - roll restart of mw-mcrouter to pick up new settings - T255511

The change has been rolled out, but there is one use case that changed, namely the mcrouter proxies in codfw. All the eqiad mcrouters are configured to use 4 mw2* mcrouters (via TLS) in codfw as proxies to the mc2* memcached shards. For example:

set /*/mw-wan/somekey sent to a eqiad mcrouter will cause:

  • a set to one mc10xx shard (following consistent hashing of somekey)
  • a set to one mw2* configured as proxy, that in turn will send the set to a mc20xx shard (following consistent hashing of somekey)

The main problem is that the mw2* are considered like memcached shards by the eqiad mcrouters, so they are subject to the TKO policy as well, but they don't have a gutter pool (yet). This means that the change that we applied, namely waiting a minute after a TKO before start probing a shard for availability/readiness, worsen a bit the time to recover. Since not gutter pool is configured, it means that a minute of "blackhole" will likely happen for all the keys to be proxied when a tko happens for mw2*.

Remaining things to do:

  1. think about a "proxy-gutter-pool" for codfw proxies (likely in another task)
  2. verify if the one minute delay added works better than the previous value (3s), and decide if we need to tune it further.

Today there has been a failover to the gutter pool:

https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&from=1595136404487&to=1595137240851

From the logs of one mcrouter instance:

Jul 19 05:30:35 mw1369 mcrouter[161684]: I0719 05:30:35.611946 161690 ProxyDestination.cpp:453] 10.64.48.158:11211 marked soft TKO. Total hard TKOs: 0; soft TKOs: 1. Reply: mc_res_timeout

Jul 19 05:31:51 mw1369 mcrouter[161684]: I0719 05:31:51.767637 161690 ProxyDestination.cpp:453] 10.64.48.158:11211 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok

The one minute wait time before fallback seems working as expected. From the above graph it also seems that there are less single spikes of TKOs registered, that is also good. Not sure what is the best final value to use (1 min is probably not the one), but this seems the right direction.

I am not very confident there is a "right" value, given that it will depend on the circumstances every time the gutter pool kicks in. Since mcrouter flapping is part of the equation, my opinion is to consider with "how much" flapping we are OK with. If there are no objections, and since 1m worked well, we can consider increasing it to 5m, until the next time.

jijiki triaged this task as Medium priority.Jul 20 2020, 10:06 PM

I can't find any incident documentation for an incident on 2020-06-08, and I'm unclear on what problem was caused by mcrouter flapping. Was mc1029 slow, able to serve VERSION probes, but unable to serve a significant amount of real traffic? So while mc1029 was pooled, requests were handled slowly or gave errors?

I can't find any incident documentation for an incident on 2020-06-08, and I'm unclear on what problem was caused by mcrouter flapping. Was mc1029 slow, able to serve VERSION probes, but unable to serve a significant amount of real traffic? So while mc1029 was pooled, requests were handled slowly or gave errors?

The incident report was never published because certain details are still sensitive, but it's available here with a wikimedia.org login: https://docs.google.com/document/d/1SYwcIL9huhgb5JxCemFZk4OjOyE_0Hmu5qw5cZR_uhs/edit

mc1029 was TKO because it was being overloaded; the report doesn't say specifically what the bottleneck was, but my recollection is around that time it was typically network bandwidth. The gutter pool hosts had beefier NICs that could handle the traffic easily, but as soon as mc1029 was no longer being flooded, it was perfectly healthy -- hence the flapping until it was firewalled off.

Joe claimed this task.
Joe subscribed.

I think this task was completed. Feel free to reopen if that's not the case.