Since 16th Dec ~13:50 we can mw-mcrouter in eqiad has been logging errors connecting to memcached servers
eqiad:
https://logstash.wikimedia.org/goto/cdcc1d4a5bf1e57824c455ddc81bdd6e
codfw:
https://logstash.wikimedia.org/goto/71141b2137a4edae3156b364e01438db
It is quite odd that codfw is handling a lot
From Mcrouters PoV:
eqiad: https://grafana.wikimedia.org/goto/efg76kg9965tsa?orgId=1
codfw: https://grafana.wikimedia.org/goto/afg76m6hulblsd?orgId=1
It is also visible here that there are noticeable more TKOs recorded by mcrouter
The impact is that due to those connection (?) errors, eqiad is switching to the gutter pool. It is notable that eqiad is much less traffic than codfw, while soon we will be switching to eqiad as part of the March DC switchover
Note: I created a grafana dashboard in an effort to combine various metrics that could help: https://grafana.wikimedia.org/goto/efggyhi41r18gf?orgId=1
Note2: Some errors are due to T374366: Race condition in iptables rules during puppet runs on k8s nodes, while others are due to the occasional ooms















