Page MenuHomePhabricator

mw1304: Memcached error for key X on server A TIMEOUT OCCURRED
Closed, ResolvedPublic


Since roughly 3:40 am UTC, mw1304 emits TIMEOUT OCCURRED memcached errors at a rate of roughly 400 events per minute.

mw1304_memcached.png (458×945 px, 63 KB)

Event Timeline

It also does not show up in so maybe the host is broken somehow.

Mentioned in SAL (#wikimedia-operations) [2021-03-30T07:37:21Z] <elukey> restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734

The timeout errors have vanished. No idea why the job runner would over run video transcoding on a given host though.

hashar assigned this task to elukey.

Root cause is not addressed but flushing the stuck php transcode jobs has made the server responsive again.

The same issue happened later which is now tracked in an incident document.