In T203786 we discovered that mcrouter does not work as we thought when a shard is marked as temporary down due to too many timeouts registered. We are used to what nutcracker was doing, namely removing the shard from consistent hashing transparently to Mediawiki, meanwhile mcrouter does not do that as it was confirmed in https://github.com/facebook/mcrouter/issues/271.
This is what happens with mcrouter:
- Three 1s timeouts are hit for a specific shard, so it is marked as TKO and no more traffic is routed to it as protection.
- All the GET/SET/etc.. for the keys handled by the failing shard are not re-hashed elsewhere, therefore they immediately lead to errors during this timeframe.
- mcrouter waits 3s before starting to send health checks to the shard, and it starts sending traffic back to it only when the first health checks passes.
Opening this task to discuss the solutions proposed in the facebook's gh issue or to come up with a different strategy about how to handle memcached timeouts.