mw1304: Memcached error for key X on server A TIMEOUT OCCURRED
Since roughly 3:40 am UTC, mw1304 emits TIMEOUT OCCURRED memcached errors at a rate of roughly 400 events per minute.

mw1304_memcached.png (458×945 px, 63 KB)

It also does not show up in so maybe the host is broken somehow.

Mentioned in SAL (#wikimedia-operations) [2021-03-30T07:37:21Z] <elukey> restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734

The timeout errors have vanished. No idea why the job runner would over run video transcoding on a given host though.

Root cause is not addressed but flushing the stuck php transcode jobs has made the server responsive again.

The same issue happened later which is now tracked in an incident document.