From alertmanager https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive:
alertname: ToolsGridQueueProblem project: tools 1 summary: Grid queue webgrid-generic@tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud is in state E 24 hours agohost: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud instance: tools-sgegrid-master job: node queue: webgrid-generic severity: warn state: E runnbook
Following the runbook, the output of the runbook:
dcaro@vulcanus$ cookbook wmcs.toolforge.grid.get_cluster_status --project tools --only-failed
START - Cookbook wmcs.toolforge.grid.get_cluster_status
tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud: !!python/object:cookbooks.wmcs.toolforge.grid.GridNodeInfo
arch_string: lx-amd64
load_avg: '0.04'
m_core: '4'
m_socket: '4'
m_thread: '4'
mem_total: 7.8G
mem_used: 1.8G
name: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud
num_proc: '4'
queues_info:
webgrid-generic: !!python/object:cookbooks.wmcs.toolforge.grid.GridQueueInfo
name: webgrid-generic
slots: '256'
slots_resv: '0'
slots_used: '11'
statuses: !GridQueueStatesSet
- !GridQueueState 'ERROR'
types: !GridQueueTypesSet
- !GridQueueType 'BATCH'
swap_total: 24.0M
swap_used: '0.0'The service sge_exec is up and running, the logs show:
root@tools-sgewebgen-10-2:~# tail -n 100 /var/spool/gridengine/execd/tools-sgewebgen-10-2/messages
02/10/2022 13:29:00| main|tools-sgewebgen-10-2|I|starting up SGE 8.1.9 (lx-amd64)
03/07/2022 20:10:41| main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors
04/05/2022 11:03:16| main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors
04/05/2022 12:14:51| main|tools-sgewebgen-10-2|E|commlib error: got read error (closing "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1")
05/13/2022 03:28:05| main|tools-sgewebgen-10-2|I|controlled shutdown 8.1.9
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|unable to create file /var/run/gridengine/execd.pid: Bad file descriptor
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|fopen("/var/run/gridengine/execd.pid") failed: Permission denied
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|I|starting up SGE 8.1.9 (lx-amd64)
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|abnormal termination of shepherd for job 447484.1: "exit_status" file is empty
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|can't open usage file "active_jobs/447484.1/usage" for job 447484.1: No such file or directory
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|shepherd exited with exit status 19: before writing exit_status
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|abnormal termination of shepherd for job 387979.1: "exit_status" file is empty
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|can't open usage file "active_jobs/387979.1/usage" for job 387979.1: No such file or directory
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|shepherd exited with exit status 19: before writing exit_status
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|recursive rmdir(/tmp/387979.1.webgrid-generic): opendir(/tmp/387979.1.webgrid-generic) failed: No such file or directory
06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|recursive rmdir(/tmp/447484.1.webgrid-generic): opendir(/tmp/447484.1.webgrid-generic) failed: No such file or directory
06/03/2022 17:18:39| main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors
06/03/2022 17:18:39| main|tools-sgewebgen-10-2|E|commlib error: got select error (Connection refused)
06/27/2022 17:17:05| main|tools-sgewebgen-10-2|E|shepherd of job 838478.1 exited with exit status = 15Will clear up the queue and see if it happens again.