We had an increase on our application servers at ~19:40, which, possibly, overwhelmed memcached host mc1020. Since the mcrouter was already flapping between the host and the gutter-pull, I firewalled the host until the event slowed down. We had alerts both for memcached and mediawiki errors
Appservers GET latency
{F32187021}
Increased TKOs
{F32187027}
Network utilisation on mc1020
{F32187025}
mc-gp1002 traffic
{F32187019}
It appears that bw saturation started a bit earlier on this host, but not sure if related
{F32187023}
Errors in Kibana:
https://logstash.wikimedia.org/goto/81369345cdbc9e9a34283ba26381df36
I also disabled puppet and downtimed memcacached on icinga, those changes need to be reverting in the EU morning. Firewalling probably helped, but I suspect that the traffic went down as well.