**Goals**
[] Test if failover works and failover strategies.
First check that mrouter failovers to the gutter servers when a shard becomes unavailable. How FailoverWithExptimeRoute works?
[] Check key integrity during and after a failover
Investigate what happens with the existing keys in a shard that was unavailable and now is back online. We would like to know if it will server stale keys for instance. A way we can test this is by creating keys with either short o
[] Test how LRU behaves in buster
Memcached 1.5.x (buster) has a few changes, including how keys are evicted from memory. We would like to keep one (or more) shard down for a long period of time, have servers failover to the gutter pool ones, gather metrics and compare with our memcached 1.4.x servers.
[] Test 'gutter proxies'
Mediawiki sets/gets some keys with the prefix `/*/mw-wan/`. Those keys are replicated from the primary to the secondary DC, via mcrouter. To do so, we have defined 4 a set of 4 mcrouter proxies located at the destination. We would like to have an extra set of "gutter proxies" i.e. another 4 mrouter instances, where a mcrouter from the primary DC can failover to if one of the destination proxies is down. Note that each mediawiki server is running one mcrouter instance
**Testing Environment**
* mwdebug1001: we have deployed a configuration where we instruct mcrouter to use the gutter pool when a shard fails, config.json: P10383
* We push iptables rules to block traffic to a specific or all memcached servers from the main pool, so to cause connection errors
* mc-gp100[1-3]: gutter pool servers aka gutter pool cluster, running memcached 1.5.x version on buster
* mediawiki-07 (beta): We generate traffic towards mwdebug1001 by going through a list of 90 URLs, 1 req/s
We will be blocking traffic from mwdebug -> mc* and get metrics/data in the following cases:
* block random shards in random intervals
* block a shard for a long amount of time (eg 1 hour, 2 hous, 2 days)
* block a shards for a long amount of time (eg 1 hour, 2 hous, 2 days)
* block shards for a very long amount of time (1 week)
Additionally, we will run a similar test to observer how mcrouter behaves when failing over to a secondary set of proxies when replicating keys (aka gutter proxies)
Graphs and logs:
* https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached_gutter&var-instance=All
* https://logstash.wikimedia.org/goto/722d8bac06b235e38e36cdbeceeed92a
**Testing in Production Roadmap**
Initially we want to test the failover function in production with minimum risk. The keys we can easily afford to loose without user impact, are the keys we replicate from eqiad to codfw
(`/*/mw-wan keys`) via the proxies. We can then move forward with trying out the gutter pool cluster in the canary servers.
Current issues (non blocking):
[] tko per server metric for the exporter seems not working
[] investigate if there are other failover metrics that we can use, and if they have value