Page MenuHomePhabricator

Investigate intermittent uptime health check failures
Closed, ResolvedPublicBUG REPORT

Description

Event Timeline

CPU usage by production MW pods:

image.png (289×523 px, 15 KB)

I'm also suspicious about the number of errors about Redis storages failing due to being too slow

image.png (288×1 px, 44 KB)

It's very uneducated but a cheap thing to try may be to increase the replicaCount of these web pod. I stuck up https://github.com/wmde/wbaas-deploy/pull/1659

This has improved but perhaps not fully resolved. Still have multiple restarts in the last 2 days

Hard to make any guessed without seeing more logs.
Whats the actual content of some of the redis logs? What keys are being accessed and are slow?
What does the CPU and memory usage of redis look like? and the logs and uptime of redis itself?