Page MenuHomePhabricator

Cannot delete job in status 'dr' in Tool Labs
Closed, ResolvedPublic

Description

I usually run a tool on a Tomcat webserver:

https://tools.wmflabs.org/replacer/

Today I have tried to stop the server in order to make a fresh re-deployment, and the job is stuck:

tools.replacer@tools-bastion-02:~$ webservice tomcat stop
Stopping webservice...............
tools.replacer@tools-bastion-02:~$ webservice tomcat status
Your webservice of type tomcat is running

I have even tried with the qdel -f command:

tools.replacer@tools-bastion-02:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
6952476 0.35652 tomcat-rep tools.replac dr    06/06/2018 18:04:02 webgrid-generic@tools-webgrid-     1

Event Timeline

root@tools-bastion-05:~# qstat -u tools.replacer -xml
<?xml version='1.0'?>
<job_info  xmlns:xsd="http://gridengine.sunsource.net/source/browse/*checkout*/gridengine/source/dist/util/resources/schemas/qstat/qstat.xsd?revision=1.11">
  <queue_info>
    <job_list state="running">
      <JB_job_number>6952476</JB_job_number>
      <JAT_prio>0.35658</JAT_prio>
      <JB_name>tomcat-replacer</JB_name>
      <JB_owner>tools.replacer</JB_owner>
      <state>dr</state>
      <JAT_start_time>2018-06-06T18:04:02</JAT_start_time>
      <queue_name>webgrid-generic@tools-webgrid-generic-1402.eqiad.wmflabs</queue_name>
      <slots>1</slots>
    </job_list>
  </queue_info>
  <job_info>
  </job_info>
</job_info>
root@tools-bastion-05:~# ping tools-webgrid-generic-1402
PING tools-webgrid-generic-1402.tools.eqiad.wmflabs (10.68.18.50) 56(84) bytes of data.
From tools-bastion-05.tools.eqiad.wmflabs (10.68.23.74) icmp_seq=1 Destination Host Unreachable
From tools-bastion-05.tools.eqiad.wmflabs (10.68.23.74) icmp_seq=2 Destination Host Unreachable
From tools-bastion-05.tools.eqiad.wmflabs (10.68.23.74) icmp_seq=3 Destination Host Unreachable
^C
--- tools-webgrid-generic-1402.tools.eqiad.wmflabs ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3012ms
pipe 3

Is tools-webgrid-generic-1402 down?

Affected jobs:

root@tools-bastion-05:~# qstat -q '*'@tools-webgrid-generic-1402.eqiad.wmflabs -u '*'
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 544477 0.44357 uwsgi-pyth tools.cvrmin dRr   06/06/2018 17:58:51 webgrid-generic@tools-webgrid-     1        
6948715 0.35664 uwsgi-pyth tools.navlin r     06/06/2018 16:18:46 webgrid-generic@tools-webgrid-     1        
6948716 0.35664 uwsgi-pyth tools.cobot  r     06/06/2018 16:18:50 webgrid-generic@tools-webgrid-     1        
6948719 0.35664 uwsgi-pyth tools.orpheu r     06/06/2018 16:18:58 webgrid-generic@tools-webgrid-     1        
6948743 0.35664 uwsgi-pyth tools.test-w r     06/06/2018 16:19:42 webgrid-generic@tools-webgrid-     1        
6948750 0.35664 uwsgi-pyth tools.conten dr    06/06/2018 16:20:02 webgrid-generic@tools-webgrid-     1        
6948864 0.35663 uwsgi-pyth tools.wdcat  r     06/06/2018 16:20:35 webgrid-generic@tools-webgrid-     1        
6952088 0.35659 uwsgi-plai tools.wlm-de r     06/06/2018 17:59:42 webgrid-generic@tools-webgrid-     1        
6952448 0.35659 uwsgi-pyth tools.canary r     06/06/2018 18:02:31 webgrid-generic@tools-webgrid-     1        
6952452 0.35659 generic-su tools.sugges dr    06/06/2018 18:02:58 webgrid-generic@tools-webgrid-     1        
6952453 0.35659 uwsgi-pyth tools.blogco dr    06/06/2018 18:03:02 webgrid-generic@tools-webgrid-     1        
6952468 0.35659 uwsgi-pyth tools.ldap   dr    06/06/2018 18:03:36 webgrid-generic@tools-webgrid-     1        
6952469 0.35659 uwsgi-pyth tools.contac dr    06/06/2018 18:03:40 webgrid-generic@tools-webgrid-     1        
6952473 0.35659 uwsgi-pyth tools.clicks r     06/06/2018 18:03:52 webgrid-generic@tools-webgrid-     1        
6952476 0.35659 tomcat-rep tools.replac dr    06/06/2018 18:04:02 webgrid-generic@tools-webgrid-     1        
6952488 0.35659 nodejs-nee tools.neecha r     06/06/2018 18:04:46 webgrid-generic@tools-webgrid-     1        
9284809 0.54394 nodejs-sit tools.sit    Rr    06/06/2018 17:58:51 webgrid-generic@tools-webgrid-     1        
1367770 0.33754 generic-wd tools.wd-con t     07/04/2018 10:21:04 webgrid-generic@tools-webgrid-     1        
1367831 0.33753 generic-wd tools.wd-con t     07/04/2018 10:24:52 webgrid-generic@tools-webgrid-     1        
1367976 0.33753 generic-wd tools.wd-con t     07/04/2018 10:23:30 webgrid-generic@tools-webgrid-     1        
1368114 0.33753 generic-wd tools.wd-con t     07/04/2018 10:21:42 webgrid-generic@tools-webgrid-     1        
1368289 0.33753 generic-wd tools.wd-con t     07/04/2018 10:20:19 webgrid-generic@tools-webgrid-     1        
1368330 0.33753 generic-wd tools.wd-con t     07/04/2018 10:23:27 webgrid-generic@tools-webgrid-     1        
 394500 0.30000 test       tools.botwik Eqw   06/15/2018 08:02:11                                    1 1-5:1

Mentioned in SAL (#wikimedia-cloud) [2018-08-27T22:36:14Z] <zhuyifei1999_> # exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs T202932

root@tools-bastion-05:~# qhost -j -h tools-webgrid-generic-1402.eqiad.wmflabs
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-webgrid-generic-1402.eqiad.wmflabs lx26-amd64      4     -    7.8G       -   23.9G       -
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID 
   ----------------------------------------------------------------------------------------------
    544477 0.44357 uwsgi-pyth tools.cvrmin dRr   06/06/2018 17:58:51 webgrid-ge MASTER        
   6948715 0.35664 uwsgi-pyth tools.navlin r     06/06/2018 16:18:46 webgrid-ge MASTER        
   6948716 0.35664 uwsgi-pyth tools.cobot  r     06/06/2018 16:18:50 webgrid-ge MASTER        
   6948719 0.35664 uwsgi-pyth tools.orpheu r     06/06/2018 16:18:58 webgrid-ge MASTER        
   6948743 0.35664 uwsgi-pyth tools.test-w r     06/06/2018 16:19:42 webgrid-ge MASTER        
   6948750 0.35664 uwsgi-pyth tools.conten dr    06/06/2018 16:20:02 webgrid-ge MASTER        
   6948864 0.35664 uwsgi-pyth tools.wdcat  r     06/06/2018 16:20:35 webgrid-ge MASTER        
   6952088 0.35659 uwsgi-plai tools.wlm-de r     06/06/2018 17:59:42 webgrid-ge MASTER        
   6952448 0.35659 uwsgi-pyth tools.canary r     06/06/2018 18:02:31 webgrid-ge MASTER        
   6952452 0.35659 generic-su tools.sugges dr    06/06/2018 18:02:58 webgrid-ge MASTER        
   6952453 0.35659 uwsgi-pyth tools.blogco dr    06/06/2018 18:03:02 webgrid-ge MASTER        
   6952468 0.35659 uwsgi-pyth tools.ldap   dr    06/06/2018 18:03:36 webgrid-ge MASTER        
   6952469 0.35659 uwsgi-pyth tools.contac dr    06/06/2018 18:03:40 webgrid-ge MASTER        
   6952473 0.35659 uwsgi-pyth tools.clicks r     06/06/2018 18:03:52 webgrid-ge MASTER        
   6952476 0.35659 tomcat-rep tools.replac dr    06/06/2018 18:04:02 webgrid-ge MASTER        
   6952488 0.35659 nodejs-nee tools.neecha r     06/06/2018 18:04:46 webgrid-ge MASTER        
   9284809 0.54394 nodejs-sit tools.sit    Rr    06/06/2018 17:58:51 webgrid-ge MASTER        
   1367770 0.33754 generic-wd tools.wd-con t     07/04/2018 10:21:04 webgrid-ge MASTER        
   1367831 0.33753 generic-wd tools.wd-con t     07/04/2018 10:24:52 webgrid-ge MASTER        
   1367976 0.33753 generic-wd tools.wd-con t     07/04/2018 10:23:30 webgrid-ge MASTER        
   1368114 0.33753 generic-wd tools.wd-con t     07/04/2018 10:21:42 webgrid-ge MASTER        
   1368289 0.33753 generic-wd tools.wd-con t     07/04/2018 10:20:19 webgrid-ge MASTER        
   1368330 0.33753 generic-wd tools.wd-con t     07/04/2018 10:23:27 webgrid-ge MASTER        
root@tools-bastion-05:~# which exec-manage
/usr/local/sbin/exec-manage
root@tools-bastion-05:~# less `which exec-manage`
root@tools-bastion-05:~# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs
root@tools-bastion-05.tools.eqiad.wmflabs changed state of "webgrid-generic@tools-webgrid-generic-1402.eqiad.wmflabs" (disabled)
This exec node has been depooled, and jobs that were running               prior have been rescheduled (if rerunable). Current status: 
    544477 0.44357 uwsgi-pyth tools.cvrmin dRr   06/06/2018 17:58:51 webgrid-ge MASTER        
   6948750 0.35664 uwsgi-pyth tools.conten dr    06/06/2018 16:20:02 webgrid-ge MASTER        
   6952452 0.35659 generic-su tools.sugges dr    06/06/2018 18:02:58 webgrid-ge MASTER        
   6952453 0.35659 uwsgi-pyth tools.blogco dr    06/06/2018 18:03:02 webgrid-ge MASTER        
   6952468 0.35659 uwsgi-pyth tools.ldap   dr    06/06/2018 18:03:36 webgrid-ge MASTER        
   6952469 0.35659 uwsgi-pyth tools.contac dr    06/06/2018 18:03:40 webgrid-ge MASTER        
   6952476 0.35659 tomcat-rep tools.replac dr    06/06/2018 18:04:02 webgrid-ge MASTER        
root@tools-bastion-05:~# qhost -j -h tools-webgrid-generic-1402.eqiad.wmflabs
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-webgrid-generic-1402.eqiad.wmflabs lx26-amd64      4     -    7.8G       -   23.9G       -
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID 
   ----------------------------------------------------------------------------------------------
    544477 0.44357 uwsgi-pyth tools.cvrmin dRr   06/06/2018 17:58:51 webgrid-ge MASTER        
   6948750 0.35664 uwsgi-pyth tools.conten dr    06/06/2018 16:20:02 webgrid-ge MASTER        
   6952452 0.35659 generic-su tools.sugges dr    06/06/2018 18:02:58 webgrid-ge MASTER        
   6952453 0.35659 uwsgi-pyth tools.blogco dr    06/06/2018 18:03:02 webgrid-ge MASTER        
   6952468 0.35659 uwsgi-pyth tools.ldap   dr    06/06/2018 18:03:36 webgrid-ge MASTER        
   6952469 0.35659 uwsgi-pyth tools.contac dr    06/06/2018 18:03:40 webgrid-ge MASTER        
   6952476 0.35659 tomcat-rep tools.replac dr    06/06/2018 18:04:02 webgrid-ge MASTER
`

Mentioned in SAL (#wikimedia-cloud) [2018-08-27T23:39:58Z] <bd808> # exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs T202932

I have been able to restart the Tomcat server. I think this issue can be closed.

Benjavalero claimed this task.