**Overview**
Our object caching service is based on mcrouter and memcached. Mcrouter is a memcached protocol router for scaling memcached. Currently, each mediawiki server is running an instance of mcrouter which in turn is configured with the same pool of memcached servers that constitute our pool.
**Current Issues**
* When a shard becomes unavailable, we get TKOs which cause latency problems T203786 T208934 T239983
* All memcached servers are on Debian Jessie, with its LTS support ending in June 2020
* Redis is co-located in the same set of serversTesting Environment**
To address the above issues, we will initially introduce a secondary pool of memcached servers, called gutter servers, capable of temporarily replacing any unavailable servers. This functionality will be provided by mcrouter. When we have the gutter servers in place and failover works, we can proceed with rolling upgrading all memcached servers to Debian buster. Since there have not been any major changes in the memcached protocol, we do not expect any major issues.
Another thing to take into account is mcrouter proxies. We have 4 mw servers in each DC which are used to replicate specific keys (dictated by mediawiki) from one datacentre to the other. We want to test the gutter pool functionality on the proxy level, i.e. define a secondary set of mcrouter proxy servers on each DC, where mcrouter will failover to in case a primary proxy is unavailable.
Lastly, developers are already working on completely retiring the use of Redis in Mediawiki, thus there will be no need to worry about its upgrade.* mwdebug1001: we have deployed a configuration where we instruct mcrouter to use the gutter pool when a shard fails, (TBA links to related tasks)
**Action Plan**
[x] Test gutter pool servers in beta
[x] Test new memcached settings in betaconfig.json: P10383
[x] Make relevant puppet changes to get gutter pool metric * We push iptables rules to block traffic to a specific or all memcached servers from the main pool, so to cause connection errors
[x] Make relevant puppet changes to support memcached on Debian Buster
[] Make relevant puppet changes to support the gutter pool configuration
[x] Image 6 new gutter servers (3 in eqiad* mc-gp100[1-3]: gutter pool servers aka gutter pool cluster, 3 in codfw)
[] Enable and test the gutter pool in secondary datacente (codfw)running memcached 1.5.x version on buster
[] Test proxy gutter pool in eqiad and/or codfw
[] Enable and test the gutter pool in primary datacentre (eqiad)* mediawiki-07 (beta): We generate traffic towards mwdebug1001 by going through a list of 90 URLs, 1 req/s
**Goals**
[] Roll upgrade to buster in secondary datacente (codfw)Test if failover works and failover strategies
[] Roll upgrade to buster in secondary datacente (eqiad)
Related tasks: T203786
Reads:Check key integrity during and after a failover
[] Test how LRU behaves in buster
TBA
* https://github.com/facebook/mcrouter/wiki/Shadowing-setup