From alertmanager https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive:
alertname: ToolsGridQueueProblem project: tools 1 summary: Grid queue webgrid-generic@tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud is in state E 24 hours agohost: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud instance: tools-sgegrid-master job: node queue: webgrid-generic severity: warn state: E runnbook
Following the runbook, the output of the runbook:
dcaro@vulcanus$ cookbook wmcs.toolforge.grid.get_cluster_status --project tools --only-failed START - Cookbook wmcs.toolforge.grid.get_cluster_status tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud: !!python/object:cookbooks.wmcs.toolforge.grid.GridNodeInfo arch_string: lx-amd64 load_avg: '0.04' m_core: '4' m_socket: '4' m_thread: '4' mem_total: 7.8G mem_used: 1.8G name: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud num_proc: '4' queues_info: webgrid-generic: !!python/object:cookbooks.wmcs.toolforge.grid.GridQueueInfo name: webgrid-generic slots: '256' slots_resv: '0' slots_used: '11' statuses: !GridQueueStatesSet - !GridQueueState 'ERROR' types: !GridQueueTypesSet - !GridQueueType 'BATCH' swap_total: 24.0M swap_used: '0.0'
The service sge_exec is up and running, the logs show:
root@tools-sgewebgen-10-2:~# tail -n 100 /var/spool/gridengine/execd/tools-sgewebgen-10-2/messages 02/10/2022 13:29:00| main|tools-sgewebgen-10-2|I|starting up SGE 8.1.9 (lx-amd64) 03/07/2022 20:10:41| main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors 04/05/2022 11:03:16| main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors 04/05/2022 12:14:51| main|tools-sgewebgen-10-2|E|commlib error: got read error (closing "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1") 05/13/2022 03:28:05| main|tools-sgewebgen-10-2|I|controlled shutdown 8.1.9 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|unable to create file /var/run/gridengine/execd.pid: Bad file descriptor 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|fopen("/var/run/gridengine/execd.pid") failed: Permission denied 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|I|starting up SGE 8.1.9 (lx-amd64) 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|abnormal termination of shepherd for job 447484.1: "exit_status" file is empty 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|can't open usage file "active_jobs/447484.1/usage" for job 447484.1: No such file or directory 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|shepherd exited with exit status 19: before writing exit_status 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|abnormal termination of shepherd for job 387979.1: "exit_status" file is empty 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|can't open usage file "active_jobs/387979.1/usage" for job 387979.1: No such file or directory 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|shepherd exited with exit status 19: before writing exit_status 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|recursive rmdir(/tmp/387979.1.webgrid-generic): opendir(/tmp/387979.1.webgrid-generic) failed: No such file or directory 06/03/2022 11:05:17| main|tools-sgewebgen-10-2|E|recursive rmdir(/tmp/447484.1.webgrid-generic): opendir(/tmp/447484.1.webgrid-generic) failed: No such file or directory 06/03/2022 17:18:39| main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors 06/03/2022 17:18:39| main|tools-sgewebgen-10-2|E|commlib error: got select error (Connection refused) 06/27/2022 17:17:05| main|tools-sgewebgen-10-2|E|shepherd of job 838478.1 exited with exit status = 15
Will clear up the queue and see if it happens again.