Since roughly 3:40 am UTC, mw1304 emits TIMEOUT OCCURRED memcached errors at a rate of roughly 400 events per minute.
https://logstash.wikimedia.org/goto/925eb928dffdd7727e80eb95415feb76
Since roughly 3:40 am UTC, mw1304 emits TIMEOUT OCCURRED memcached errors at a rate of roughly 400 events per minute.
https://logstash.wikimedia.org/goto/925eb928dffdd7727e80eb95415feb76
It also does not show up in https://grafana.wikimedia.org/d/000000377/host-overview so maybe the host is broken somehow.
Mentioned in SAL (#wikimedia-operations) [2021-03-30T07:37:21Z] <elukey> restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734
The timeout errors have vanished. No idea why the job runner would over run video transcoding on a given host though.
Root cause is not addressed but flushing the stuck php transcode jobs has made the server responsive again.