Shortly after 18:30 today, mw-api-ext (eqiad) alerted for probe failures (page), php-fpm worker saturation, and latency.
@CDanis prepared a timeline from the IRC log here: https://etherpad.wikimedia.org/p/2024-05-24
There are some interesting observations in there, including:
- Unusually high tx from a single memcache task in the 18:28 - 18:30 range (e.g., due to a hot key)
- OOM kills in the ~ 18:35 range, affecting both mw-api-ext and mw-jobrunner
- mwlog1002 alerted for packet loss during that window, possibly due to large volume of logs
We also see unusually high network tx from mw-api-ext in the 18:35-40 range (https://grafana.wikimedia.org/goto/jcAq9aPIg?orgId=1). Once sflow ingestion is done, this might be useful in identifying what behavior is involved. It doesn't look like memcache traffic (e.g., there's no corresponding rx jump), but it might be log volume.
One other potentially useful observation is a 5x bump in write ops to s4 (commons) correlated with the start of slowness (https://grafana.wikimedia.org/goto/fUvhQ-ESR?orgId=1). I say "potentially" here, as this doesn't seem to be a terribly unusual occurrence on its own.
PHP slow-log events for mw-api-ext clearly kick up in the 18:33 minute (https://logstash.wikimedia.org/app/dashboards#/view/74557260-a88f-11ed-96bb-4b4732aa077a?_g=h@cc7791f&_a=h@af8cf1d), but it's challenging to discern cause from effect (e.g., there are many slow DB queries, but no consistent culprit code path).
I'll check back later on once turnilo has sflow data for the relevant window.
In any case, opening this to get everything written down in one place, particularly in case folks have seen this kind of event previously.