Page MenuHomePhabricator

toolforge: gridengine: a case of apparently orphaned jobs running (jarbot)
Closed, InvalidPublic

Description

Today I was requested by @jijiki to stop the JarBot tool running in toolforge.

I disabled all the cronjobs and deleted all the jobs by hand:

tools.jarbot@tools-sgebastion-08:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 357141 0.26591 MYFSQL     tools.jarbot r     01/20/2020 16:59:19 task@tools-sgeexec-0915.tools.     1        
4055387 0.29936 NMYRDV5    tools.jarbot r     01/04/2020 09:00:57 task@tools-sgeexec-0912.tools.     1        
 268515 0.25006 NMYRDV5ALS tools.jarbot r     01/28/2020 10:30:28 task@tools-sgeexec-0921.tools.     1        
 269036 0.25003 APRSVWB    tools.jarbot r     01/28/2020 10:45:13 task@tools-sgeexec-0937.tools.     1        
 269205 0.25003 nmt500n14t tools.jarbot r     01/28/2020 10:50:28 task@tools-sgeexec-0929.tools.     1        
 269268 0.25003 nentoarv2  tools.jarbot r     01/28/2020 10:51:12 task@tools-sgeexec-0928.tools.     1        
 269278 0.25003 newtmv3.8t tools.jarbot r     01/28/2020 10:51:13 task@tools-sgeexec-0926.tools.     1        
 269291 0.25002 nmt500ts   tools.jarbot r     01/28/2020 10:52:12 task@tools-sgeexec-0933.tools.     1        
 269320 0.25002 AWD        tools.jarbot r     01/28/2020 10:54:12 task@tools-sgeexec-0904.tools.     1        
 269339 0.25002 nentoarv3  tools.jarbot r     01/28/2020 10:54:12 task@tools-sgeexec-0920.tools.     1        
 269409 0.25002 nmt500ts   tools.jarbot r     01/28/2020 10:56:12 task@tools-sgeexec-0930.tools.     1        
 269432 0.25002 nentoarv4  tools.jarbot r     01/28/2020 10:57:12 task@tools-sgeexec-0936.tools.     1        
 269581 0.25001 nmt500ts   tools.jarbot r     01/28/2020 11:09:12 task@tools-sgeexec-0930.tools.     1        
 269645 0.25001 n500refs   tools.jarbot r     01/28/2020 11:09:12 task@tools-sgeexec-0931.tools.     1        
 269646 0.25001 NMYRLN     tools.jarbot r     01/28/2020 11:09:12 task@tools-sgeexec-0925.tools.     1        
 269647 0.25001 acn500ts   tools.jarbot r     01/28/2020 11:09:12 task@tools-sgeexec-0918.tools.     1        
 269652 0.00000 AACv5      tools.jarbot qw    01/28/2020 11:00:18                                    1        
 269670 0.00000 ADRTV3ts   tools.jarbot qw    01/28/2020 11:00:18                                    1        
 269676 0.00000 ACOMSVWB   tools.jarbot qw    01/28/2020 11:00:18                                    1        
 269689 0.00000 n500refsV2 tools.jarbot qw    01/28/2020 11:00:19                                    1        
 269690 0.00000 addpsv4ts  tools.jarbot qw    01/28/2020 11:00:19                                    1        
 269691 0.00000 nentoarv10 tools.jarbot qw    01/28/2020 11:00:19                                    1        
 269732 0.00000 mtv3v1ts   tools.jarbot qw    01/28/2020 11:01:02                                    1        
 269736 0.00000 addreftagt tools.jarbot qw    01/28/2020 11:01:02                                    1        
 269742 0.00000 commonsv5t tools.jarbot qw    01/28/2020 11:01:02                                    1        
 269747 0.00000 addsts     tools.jarbot qw    01/28/2020 11:01:02                                    1        
 269771 0.00000 AWD        tools.jarbot qw    01/28/2020 11:03:02                                    1        
 269790 0.00000 nmt500ts   tools.jarbot qw    01/28/2020 11:04:01                                    1        
 269865 0.00000 AWD        tools.jarbot qw    01/28/2020 11:06:02                                    1        
 269909 0.00000 nmt500ts   tools.jarbot qw    01/28/2020 11:08:02                                    1        
 269924 0.00000 AWD        tools.jarbot qw    01/28/2020 11:09:01                                    1     
tools.jarbot@tools-sgebastion-08:~$ qdel 357141
tools.jarbot has registered the job 357141 for deletion
tools.jarbot@tools-sgebastion-08:~$ qdel 4055387
tools.jarbot has registered the job 4055387 for deletion
tools.jarbot@tools-sgebastion-08:~$ qdel 269432
[...]
tools.jarbot@tools-sgebastion-08:~$ qstat
tools.jarbot@tools-sgebastion-08:~$

Eventually all grid jobs were deleted, and nothing was reported by qstat. No webservices running.

However, the SRE team reported there was still activity from this bot in the mediawiki API:

2020-01-28 12:28:18 [XjAo4gpAIDwAADrCUswAAAAH] mw1348 urwiki 1.35.0-wmf.15 api INFO: API GET JarBot 172.16.1.232 T=13ms action=query format=json maxlag=5 titles=%D8%A8%D8%A7%D9%84%D8%AA%D8%B3%DB%8C redirects= meta=userinfo rawcontinue= uiprop=blockinfo%7Chasmsg

I could still see process running in tools-sgeexec-0919, but only with htop. This command would return nothing:

aborrero@tools-sgeexec-0919:~$ ps -U tools.jarbot -u tools.jarbot u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

It seems to me there are some orphan process somehow.

Event Timeline

aborrero renamed this task from toolforge: gridengine: a case of apaprently orphaned jobs running (jarbot) to toolforge: gridengine: a case of apparently orphaned jobs running (jarbot).Jan 28 2020, 12:46 PM

I used this to try detecting the procs:

aborrero@tools-sgeexec-0919:~$ for i in $(ls /proc | grep [0-9]) ; do sudo grep "USER=tools.jarbot" /proc/${i}/environ | grep matches >/dev/null 2>/dev/null && ps -p $i l ; done
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4 53467 11498 11496  20   0 1049012 284728 -    Rsl  ?        799:00 /usr/bin/python3 core/pwb.py core/scripts/userscripts/WikiProjectQuarry.py
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4 53468 15141 15139  20   0 229268 79668 -      Ss   ?        134:56 /usr/bin/python3 core/pwb.py core/scripts/userscripts/uralltmv3.py
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
4 53467 23110 23108  20   0 1048152 251372 -    Ss   ?         42:34 /usr/bin/python3 core/pwb.py core/scripts/userscripts/WikiProjectQuarry.py
aborrero@tools-sgeexec-0919:~$ sudo ps aux | grep tools.j
tools.j+ 11498 22.7  3.5 1054436 290004 ?      Ss   Jan26 799:54 /usr/bin/python3 core/pwb.py core/scripts/userscripts/WikiProjectQuarry.py
tools.j+ 15141  6.3  0.9 229268 79668 ?        Ss   Jan27 135:08 /usr/bin/python3 core/pwb.py core/scripts/userscripts/uralltmv3.py
tools.j+ 23110 16.3  3.0 1048152 251372 ?      Ss   08:39  42:38 /usr/bin/python3 core/pwb.py core/scripts/userscripts/WikiProjectQuarry.py
aborrero triaged this task as Medium priority.Jan 28 2020, 1:05 PM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-01-28T13:35:05Z] <arturo> aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux | grep [t]ools.j | awk -F" " "{print \$2}") ; do echo "killing $i" ; sudo kill $i ; done || true' (T243831)

I discovered there are many tools with same name (jarbot-ii, jarbot-iii) doing apparently the same thing. No orphan procs, but actual tools!