Page MenuHomePhabricator

High traffic on mc1020 (18 Aug)
Closed, ResolvedPublic

Assigned To
Authored By
jijiki
Aug 17 2020, 9:48 PM
Referenced Files
F32187019: image.png
Aug 17 2020, 10:39 PM
F32187027: image.png
Aug 17 2020, 10:39 PM
F32187025: image.png
Aug 17 2020, 10:39 PM
F32187023: image.png
Aug 17 2020, 10:39 PM
F32187021: image.png
Aug 17 2020, 10:39 PM

Description

We had an increase in our application servers traffic at ~19:40, which, possibly, overwhelmed memcached host mc1020. Since mcrouter was already flapping between the host and the gutter-pool, I firewalled the host until the event slowed down. We had alerts both for memcached and mediawiki errors.

Appservers GET latency

image.png (1×3 px, 323 KB)

Increased TKOs

image.png (906×3 px, 1 MB)

Network utilisation on mc1020

image.png (536×1 px, 83 KB)

mc-gp1002 traffic

image.png (526×1 px, 78 KB)

It appears that bw saturation started a bit earlier on this host, but not sure if related

image.png (1×3 px, 381 KB)

Errors in Kibana:
https://logstash.wikimedia.org/goto/81369345cdbc9e9a34283ba26381df36

I also disabled puppet and downtimed memcacached on icinga, those changes need to be reverting in the EU morning. Firewalling probably helped, but I suspect that the traffic went down as well.

Event Timeline

jijiki renamed this task from mc1020 traffic to High traffic on mc1020 (18 Aug) .Aug 17 2020, 10:39 PM
jijiki claimed this task.
jijiki triaged this task as Medium priority.
jijiki updated the task description. (Show Details)
jijiki updated the task description. (Show Details)
jijiki added a subscriber: elukey.

Since no analysis on the incoming keys was done, there is no way to know what the problem was. I'll uncordon mc1020, and monitor the situation, but I assume there isn't much else to do at this point.

jijiki moved this task from Unused 3 to Incoming 🐫 on the serviceops board.

We can close this for now