Page MenuHomePhabricator

tools-sgeexec-0916: ToolsGridQueueProblem, Grid queue continuous/task is in state auE
Closed, ResolvedPublic

Description

From https://prometheus-alerts.wmcloud.org:

alertname: ToolsGridQueueProblem
project: tools
summary: Grid queue task@tools-sgeexec-0916.tools.eqiad.wmflabs is in state auE
31 minutes ago
queue: task

summary: Grid queue continuous@tools-sgeexec-0916.tools.eqiad.wmflabs is in state auE
31 minutes ago
queue: continuous

host: tools-sgeexec-0916.tools.eqiad.wmflabs
instance: tools-sgegrid-master
severity: warn
state: auE
@receiver: cloud-admin-feed

Event Timeline

dcaro changed the task status from Open to In Progress.Feb 28 2022, 9:15 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Change 766765 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@wmcs] wmcs.toolforge.grid.get_cluster_status: improve yaml output

https://gerrit.wikimedia.org/r/766765

Change 766766 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/cookbooks@wmcs] wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones

https://gerrit.wikimedia.org/r/766766

Mentioned in SAL (#wikimedia-cloud) [2022-03-01T11:37:51Z] <dcaro> Adding runbook url annotation to GridQueueProblem alert on DB at metricsinfra-crontroller-1 (T302702)

Mentioned in SAL (#wikimedia-cloud) [2022-03-01T11:38:17Z] <dcaro> Reloading alertmanager to refresh new config (T302702)

Mentioned in SAL (#wikimedia-cloud) [2022-03-01T12:11:10Z] <dcaro> Cleared error state queues for sgeexec-0916 (T302702)

Change 766765 merged by jenkins-bot:

[operations/cookbooks@wmcs] wmcs.toolforge.grid.get_cluster_status: improve yaml output

https://gerrit.wikimedia.org/r/766765

Change 766766 merged by jenkins-bot:

[operations/cookbooks@wmcs] wmcs.toolforg.grid.get_cluster_status: allow filtering the ok ones

https://gerrit.wikimedia.org/r/766766

Mentioned in SAL (#wikimedia-cloud) [2022-03-01T13:41:25Z] <dcaro> rebooting tools-sgeexec-0916 to clear any state (T302702)

It seems that the error is not with the queues themselves, but with the node failing to start the sge_exec node.
After trying to manually start it (as the service was shows as active but exited, that is, not running) and checking the strace for it found that there's a log with some info:

root@tools-sgeexec-0916:~# export SGE_ROOT=/var/lib/gridengine
root@tools-sgeexec-0916:~# export SGE_CELL=default
root@tools-sgeexec-0916:~# strace -f -o trace /usr/lib/gridengine/sge_execd

root@tools-sgeexec-0916:~# grep spool trace
...
12307 open("/var/spool/gridengine/execd/tools-sgeexec-0916/messages", O_WRONLY|O_CREAT|O_APPEND, 0666) = 8
...

root@tools-sgeexec-0916:~# tail /var/spool/gridengine/execd/tools-sgeexec-0916/messages
03/02/2022 13:58:55|  main|tools-sgeexec-0916|I|starting up SGE 8.1.9 (lx-amd64)
03/02/2022 13:58:55|  main|tools-sgeexec-0916|E|abnormal termination of shepherd for job 7040827.1: "exit_status" file is empty
03/02/2022 13:58:55|  main|tools-sgeexec-0916|E|can't open usage file "active_jobs/7040827.1/usage" for job 7040827.1: No such file or directory
03/02/2022 13:58:55|  main|tools-sgeexec-0916|E|shepherd exited with exit status 19: before writing exit_status
03/02/2022 13:58:55|  main|tools-sgeexec-0916|C|malloc() failure

continuing...

Ok, next step, after trying to create the missing file, the service would keep complaining about other files missing or empty, and crashing:

root@tools-sgeexec-0916:~# tail /var/spool/gridengine/execd/tools-sgeexec-0916/messages
...
03/02/2022 14:37:09|  main|tools-sgeexec-0916|E|abnormal termination of shepherd for job 7040827.1: "exit_status" file is empty

So I moved the directory of the job to /tmp (same device, so no extra space, the directory is 3.5GB in size):

root@tools-sgeexec-0916:/var/spool/gridengine/execd/tools-sgeexec-0916# mv active_jobs/7040827.1 /tmp/7040827.1.bkp

And started the service again:

root@tools-sgeexec-0916:/var/spool/gridengine/execd/tools-sgeexec-0916# /etc/init.d/gridengine-exec start

And this time it seems it was able to go through, and cleanup after the broken one without crashing:

root@tools-sgeexec-0916:/var/spool/gridengine/execd/tools-sgeexec-0916# tail -f -n 10 /var/spool/gridengine/execd/tools-sgeexec-0916/messages
03/02/2022 14:39:07|  main|tools-sgeexec-0916|E|recursive rmdir(/tmp/85368.1.continuous): opendir(/tmp/85368.1.continuous) failed: No such file or directory
...
03/02/2022 14:40:27|  main|tools-sgeexec-0916|E|can't remove directory "active_jobs/9999864.1": opendir(active_jobs/9999864.1) failed: No such file or directory
root@tools-sgeexec-0916:/var/spool/gridengine/execd/tools-sgeexec-0916# systemctl status gridengine-exec.service
● gridengine-exec.service - LSB: SGE Execution Daemon init script
   Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled)
   Active: active (running) since Wed 2022-03-02 14:39:06 UTC; 22s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 16218 ExecStart=/etc/init.d/gridengine-exec start (code=exited, status=0/SUCCESS)
    Tasks: 5 (limit: 4915)
   CGroup: /system.slice/gridengine-exec.service
           └─16226 /usr/lib/gridengine/sge_execd

Will repool and see next.

Ok, so that was not enough, repooling:

root@tools-sgegrid-master:~# exec-manage repool tools-sgeexec-0916.tools.eqiad.wmflabs

triggered the host to send a couple emails about the previoulsy failed jobs, and set the queues in only error status (only E):

root@tools-sgegrid-master:~# exec-manage status tools-sgeexec-0916.tools.eqiad.wmflabs
Count of jobs running on host tools-sgeexec-0916.tools.eqiad.wmflabs :
0

Jobs running on host tools-sgeexec-0916.tools.eqiad.wmflabs :
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
tools-sgeexec-0916.tools.eqiad.wmflabs lx-amd64        4    4    4    4  0.04    7.8G  451.0M   23.9G     0.0

Status of queues on this host (States - d = disabled) :
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
task@tools-sgeexec-0916.tools. BI    0/0/50         0.04     lx-amd64      E
---------------------------------------------------------------------------------
continuous@tools-sgeexec-0916. BC    0/0/50         0.04     lx-amd64      E

So I cleared them:

:

root@tools-sgegrid-master:~# exec-manage status tools-sgeexec-0916.tools.eqiad.wmflabs
Count of jobs running on host tools-sgeexec-0916.tools.eqiad.wmflabs :
0

Jobs running on host tools-sgeexec-0916.tools.eqiad.wmflabs :

HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS

global - - - - - - - - - -
tools-sgeexec-0916.tools.eqiad.wmflabs lx-amd64 4 4 4 4 0.04 7.8G 451.0M 23.9G 0.0

Status of queues on this host (States - d = disabled) :

queuename qtype resv/used/tot. load_avg arch states

task@tools-sgeexec-0916.tools. BI 0/0/50 0.04 lx-amd64 E

continuous@tools-sgeexec-0916. BC 0/0/50 0.04 lx-amd64 E

So I cleared them:

root@tools-sgegrid-master:~# sudo qmod -cq '*'

And the host is now taking jobs (see the count of jobs at the start of the output):

root@tools-sgegrid-master:~# exec-manage status tools-sgeexec-0916.tools.eqiad.wmflabs
Count of jobs running on host tools-sgeexec-0916.tools.eqiad.wmflabs :
1

Jobs running on host tools-sgeexec-0916.tools.eqiad.wmflabs :

HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS

global - - - - - - - - - -
tools-sgeexec-0916.tools.eqiad.wmflabs lx-amd64 4 4 4 4 0.17 7.8G 475.1M 23.9G 0.0

job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
----------------------------------------------------------------------------------------------
1416775 0.25000 fc_co_depi tools.fist   r     03/02/2022 14:47:14 task@tools MASTER

Status of queues on this host (States - d = disabled) :

queuename qtype resv/used/tot. load_avg arch states

task@tools-sgeexec-0916.tools. BI 0/1/50 0.17 lx-amd64

continuous@tools-sgeexec-0916. BC 0/0/50 0.17 lx-amd64

Will try to reproduce with the other node and document the process.

For tools-sgeexec-0913, all that was needed is to stop and start it again, starting only did not do the trick due to the service being an init service, and considered active even if it exited:

root@tools-sgeexec-0913:~# systemctl status sge_execd.service
● gridengine-exec.service - LSB: SGE Execution Daemon init script
   Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled)
   Active: active (exited) since Wed 2022-01-19 17:38:36 UTC; 1 months 12 days ago
     Docs: man:systemd-sysv-generator(8)
  Process: 682 ExecStart=/etc/init.d/gridengine-exec start (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/gridengine-exec.service

root@tools-sgeexec-0913:~# systemctl stop sge_execd.service
root@tools-sgeexec-0913:~# systemctl start sge_execd.service
root@tools-sgeexec-0913:~# systemctl status sge_execd.service
● gridengine-exec.service - LSB: SGE Execution Daemon init script
   Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled)
   Active: active (running) since Thu 2022-03-03 13:14:55 UTC; 1s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 11805 ExecStop=/etc/init.d/gridengine-exec stop (code=exited, status=0/SUCCESS)
  Process: 11821 ExecStart=/etc/init.d/gridengine-exec start (code=exited, status=0/SUCCESS)
    Tasks: 41 (limit: 4915)
   CGroup: /system.slice/gridengine-exec.service
   ...

Now the node is up and running, will close this and update the docs.