Page MenuHomePhabricator

ToolsGridQueueProblem - Grid queue webgrid-lighttpd@tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud is in state E
Closed, ResolvedPublic

Description

From alertmanager (https://prometheus-alerts.wmcloud.org):

alertname: ToolsGridQueueProblem
summary: Grid queue webgrid-lighttpd@tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud is in state E
9 hours ago
host: tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud
instance: tools-sgegrid-master
job: node
queue: webgrid-lighttpd
severity: warn
state: E
@receiver: cloud-admin-feed
runbook

Event Timeline

dcaro changed the task status from Open to In Progress.Nov 1 2022, 8:41 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud-feed) [2022-11-01T09:37:27Z] <wm-bot2> cleaned up grid queue errors on tools-sgegrid-master (T322110) - cookbook ran by dcaro@vulcanus

Checked the status with:

dcaro@vulcanus$ cookbook wmcs.toolforge.grid.get_cluster_status --only-failed --project tools

(and using https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/841930)

that showed that there was only one error, and it was due to the epilog issue, so just cleaned up the queues and everything is back to normal.

dcaro moved this task from Doing to Done on the User-dcaro board.