Page MenuHomePhabricator

toolforge: Set up alerting based on Kubernetes API response times
Open, Needs TriagePublic

Description

Today we had a disk issue on one of the localdisk hypervisors running Toolforge etcd nodes, which caused request times to increase by a lot:

image.png (679×1 px, 64 KB)

The third node was online during that time, so our current Toolschecker etcd check didn't alert until several hours later when etcd went briefly into an unhealthy state for whatever reason. Since we have metrics from the kubernetes api server layer, we should consider setting up alerting based on that too