Page MenuHomePhabricator

[infra] Add alert when workers have a sustained large amount of D processes
Closed, ResolvedPublic

Description

This usually indicates that the worker got stuck due to NFS going away, and should probably be restarted.

Currently we don't have any alerts so it's only when a user complains or we happen to check the graphs that we notice.

This alert does not need to page, but at least show up so we handle it when online.

Current dashboard: https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-30m&to=now&forceLogin=true

Event Timeline

dcaro changed the task status from Open to In Progress.Apr 8 2024, 3:38 PM
dcaro triaged this task as High priority.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 08) board.
dcaro updated the task description. (Show Details)
dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 09) board.

@dcaro just curious, what happens when the alert goes off? do you manually go and kill the worker?

@dcaro just curious, what happens when the alert goes off? do you manually go and kill the worker?

Currently yes, it shows up in our alert dashboard, and whomever is oncall (or if someone sees it before that) will reboot the worker node.

Ideally we would do it automatically and just notify us, someday (:fingerscrossed:)