Page MenuHomePhabricator

[toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes
Open, LowPublic

Description

Should help detect a base level of issues

Event Timeline

Restricted Application added a subscriber: Zppix. · View Herald Transcript

@Bstorm Does this seem like something interesting for the 2020 Kubernetes cluster?

Well, it's good enough for AKS, GCE and Openshift. Since it allows custom scripts, it may even be able to set permanent marks for "Toolforge faults" like in T242637: Create a "health check" for Kubernetes worker nodes which validates local Toolforge config

I wonder if an alert can turn into a taint or mark the node NotReady? Seems to merit some thought at least. I mean, what's one more exporter consuming resources to check the resources 😜

dcaro lowered the priority of this task from Medium to Low.Feb 20 2024, 12:32 PM
dcaro renamed this task from Run https://github.com/kubernetes/node-problem-detector on all our nodes to [toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes.Feb 21 2024, 10:23 AM