Page MenuHomePhabricator

[infra] Add alert when workers have a sustained large amount of D processes
Closed, ResolvedPublic

Description

This usually indicates that the worker got stuck due to NFS going away, and should probably be restarted.

Currently we don't have any alerts so it's only when a user complains or we happen to check the graphs that we notice.

This alert does not need to page, but at least show up so we handle it when online.

Current dashboard: https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-30m&to=now&forceLogin=true

Details

TitleReferenceAuthorSource BranchDest Branch
kubernetes: add workers with D stuck processes alertrepos/cloud/toolforge/alerts!11dcaroadd_d_processes_alertmain
Customize query in GitLab

Event Timeline

dcaro changed the task status from Open to In Progress.Apr 8 2024, 3:38 PM
dcaro triaged this task as High priority.
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 08) board.
dcaro updated the task description. (Show Details)
dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 09) board.