Page MenuHomePhabricator

mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED
Closed, ResolvedPublic

Description

Since roughly 3:40 am UTC, mw1304 emits TIMEOUT OCCURRED memcached errors at a rate of roughly 400 events per minute.

https://logstash.wikimedia.org/goto/925eb928dffdd7727e80eb95415feb76

mw1304_memcached.png (458×945 px, 63 KB)

Event Timeline

It also does not show up in https://grafana.wikimedia.org/d/000000377/host-overview so maybe the host is broken somehow.

Mentioned in SAL (#wikimedia-operations) [2021-03-30T07:37:21Z] <elukey> restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734

The timeout errors have vanished. No idea why the job runner would over run video transcoding on a given host though.

hashar assigned this task to elukey.

Root cause is not addressed but flushing the stuck php transcode jobs has made the server responsive again.

The same issue happened later which is now tracked in an incident document.