mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Mar 30 2021, 7:31 AM

Description

Since roughly 3:40 am UTC, mw1304 emits TIMEOUT OCCURRED memcached errors at a rate of roughly 400 events per minute.

https://logstash.wikimedia.org/goto/925eb928dffdd7727e80eb95415feb76

mw1304_memcached.png (458×945 px, 63 KB)

Event Timeline

hashar created this task.Mar 30 2021, 7:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2021, 7:31 AM

It also does not show up in https://grafana.wikimedia.org/d/000000377/host-overview so maybe the host is broken somehow.

Mentioned in SAL (#wikimedia-operations) [2021-03-30T07:37:21Z] <elukey> restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734

The timeout errors have vanished. No idea why the job runner would over run video transcoding on a given host though.

Root cause is not addressed but flushing the stuck php transcode jobs has made the server responsive again.

The same issue happened later which is now tracked in an incident document.

RhinosF1 subscribed.Mar 30 2021, 4:53 PM

mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURREDClosed, ResolvedPublicActions

Description

Event Timeline

mw1304: Memcached error for key X on server 127.0.0.1:11213: A TIMEOUT OCCURRED
Closed, ResolvedPublic
Actions