Should help detect a base level of issues
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T129309 Goal: Allow using k8s instead of GridEngine as a backend for webservices | |||
Declined | None | T131929 Setup monitoring for kubernetes core components. | |||
Open | None | T140249 [toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes |
Event Timeline
Comment Actions
Well, it's good enough for AKS, GCE and Openshift. Since it allows custom scripts, it may even be able to set permanent marks for "Toolforge faults" like in T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config
I wonder if an alert can turn into a taint or mark the node NotReady? Seems to merit some thought at least. I mean, what's one more exporter consuming resources to check the resources 😜