Page MenuHomePhabricator

alertname: ToolsGridQueueProblem - tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud - Grid queue webgrid-generic@tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud is in state E
Closed, ResolvedPublic

Description

From alertmanager https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive:

alertname: ToolsGridQueueProblem
project: tools
1
summary: Grid queue webgrid-generic@tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud is in state E
24 hours agohost: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud
instance: tools-sgegrid-master
job: node
queue: webgrid-generic
severity: warn
state: E
runnbook

Following the runbook, the output of the runbook:

dcaro@vulcanus$ cookbook wmcs.toolforge.grid.get_cluster_status --project tools --only-failed
START - Cookbook wmcs.toolforge.grid.get_cluster_status
tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud: !!python/object:cookbooks.wmcs.toolforge.grid.GridNodeInfo
  arch_string: lx-amd64
  load_avg: '0.04'
  m_core: '4'
  m_socket: '4'
  m_thread: '4'
  mem_total: 7.8G
  mem_used: 1.8G
  name: tools-sgewebgen-10-2.tools.eqiad1.wikimedia.cloud
  num_proc: '4'
  queues_info:
    webgrid-generic: !!python/object:cookbooks.wmcs.toolforge.grid.GridQueueInfo
      name: webgrid-generic
      slots: '256'
      slots_resv: '0'
      slots_used: '11'
      statuses: !GridQueueStatesSet
      - !GridQueueState 'ERROR'
      types: !GridQueueTypesSet
      - !GridQueueType 'BATCH'
  swap_total: 24.0M
  swap_used: '0.0'

The service sge_exec is up and running, the logs show:

root@tools-sgewebgen-10-2:~# tail -n 100 /var/spool/gridengine/execd/tools-sgewebgen-10-2/messages

02/10/2022 13:29:00|  main|tools-sgewebgen-10-2|I|starting up SGE 8.1.9 (lx-amd64)
03/07/2022 20:10:41|  main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors
04/05/2022 11:03:16|  main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors
04/05/2022 12:14:51|  main|tools-sgewebgen-10-2|E|commlib error: got read error (closing "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1")
05/13/2022 03:28:05|  main|tools-sgewebgen-10-2|I|controlled shutdown 8.1.9
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|unable to create file /var/run/gridengine/execd.pid: Bad file descriptor
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|fopen("/var/run/gridengine/execd.pid") failed: Permission denied
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|I|starting up SGE 8.1.9 (lx-amd64)
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|abnormal termination of shepherd for job 447484.1: "exit_status" file is empty
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|can't open usage file "active_jobs/447484.1/usage" for job 447484.1: No such file or directory
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|shepherd exited with exit status 19: before writing exit_status
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|abnormal termination of shepherd for job 387979.1: "exit_status" file is empty
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|can't open usage file "active_jobs/387979.1/usage" for job 387979.1: No such file or directory
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|shepherd exited with exit status 19: before writing exit_status
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|recursive rmdir(/tmp/387979.1.webgrid-generic): opendir(/tmp/387979.1.webgrid-generic) failed: No such file or directory
06/03/2022 11:05:17|  main|tools-sgewebgen-10-2|E|recursive rmdir(/tmp/447484.1.webgrid-generic): opendir(/tmp/447484.1.webgrid-generic) failed: No such file or directory
06/03/2022 17:18:39|  main|tools-sgewebgen-10-2|W|can't register at qmaster "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": abort qmaster registration due to communication errors
06/03/2022 17:18:39|  main|tools-sgewebgen-10-2|E|commlib error: got select error (Connection refused)
06/27/2022 17:17:05|  main|tools-sgewebgen-10-2|E|shepherd of job 838478.1 exited with exit status = 15

Will clear up the queue and see if it happens again.

Event Timeline

dcaro triaged this task as High priority.Jun 28 2022, 5:33 PM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud-feed) [2022-06-28T17:34:37Z] <wm-bot2> cleaned up grid queue errors on tools-sgegrid-master (T311538) - cookbook ran by dcaro@vulcanus

dcaro moved this task from To refine to Done on the User-dcaro board.