On tools-exec-1202:
init(1)- ... |-perl(595) |-perl(608) |-perl(3123) |-perl(3171) |-perl(3174) |-perl(4337) |-perl(4382) |-perl(5389) |-perl(6593) |-perl(8924) |-perl(8933) |-perl(9390) |-perl(9921) |-perl(12606) |-perl(17345) |-perl(19282) |-perl(19310) |-perl(19343) |-perl(19345) |-perl(19415) |-perl(19471) |-perl(19473) |-perl(19489) |-perl(19548) |-perl(19563) |-perl(19630)
all of these are various linkwatcher scripts:
valhallasw@tools-exec-1202:~$ ps aux | grep Link 51230 608 0.7 1.0 234148 85092 ? SN Jan05 30:53 perl LinkAnalyser.pl LiWa3 15 51230 3171 0.1 0.3 178992 31708 ? SN Jan05 9:16 perl LinkReporter.pl LiWa3 2 2 51230 3174 0.1 0.3 178992 31876 ? SN Jan05 9:11 perl LinkReporter.pl LiWa3 3 3 51230 4382 0.1 0.3 178992 31604 ? SN Jan05 8:32 perl LinkReporter.pl LiWa3 6 1 51230 5389 3.8 1.1 250392 96924 ? SN 09:00 19:19 perl LinkAnalyser.pl LiWa3 32 51230 6593 1.0 0.6 197032 50696 ? SN 09:17 4:54 perl LinkAnalyser.pl LiWa3 33 1092 15571 0.0 0.0 33340 948 pts/0 S+ 17:22 0:00 grep --color=auto Link 51230 17345 0.5 0.5 183440 41188 ? SN 11:30 1:54 perl LinkAnalyser.pl LiWa3 34 51230 19282 8.9 1.3 174480 111716 ? SN 11:39 30:37 perl LinkParser.pl LiWa3 295 51230 19310 8.8 1.3 172324 109484 ? SN 11:41 30:04 perl LinkParser.pl LiWa3 296 51230 19343 9.0 1.3 172132 108484 ? SN 11:43 30:34 perl LinkParser.pl LiWa3 297 51230 19345 9.0 1.3 177168 114380 ? SN 11:43 30:33 perl LinkParser.pl LiWa3 298 51230 19415 9.1 1.4 178900 114704 ? RN 11:48 30:29 perl LinkParser.pl LiWa3 299 51230 19471 8.8 1.3 170764 107820 ? SN 11:52 29:13 perl LinkParser.pl LiWa3 300 51230 19473 8.9 1.3 175928 112784 ? SN 11:52 29:38 perl LinkParser.pl LiWa3 301 51230 19489 9.0 1.3 173916 111056 ? SN 11:53 29:44 perl LinkParser.pl LiWa3 302 51230 19548 8.6 1.2 168600 105656 ? SN 11:57 28:17 perl LinkParser.pl LiWa3 303 51230 19563 9.0 1.3 177424 113312 ? SN 11:58 29:14 perl LinkParser.pl LiWa3 304 51230 19630 8.9 1.3 171756 108752 ? SN 11:59 28:48 perl LinkParser.pl LiWa3 305 51230 19632 9.0 1.3 171948 108820 ? SN 11:59 29:07 perl LinkParser.pl LiWa3 306 51230 19855 9.0 1.3 169528 106708 ? SN 11:59 29:23 perl LinkParser.pl LiWa3 307 51230 20462 8.9 1.3 174068 110140 ? SN 12:00 28:42 perl LinkParser.pl LiWa3 308 51230 21750 8.8 1.2 168640 105428 ? SN 12:01 28:22 perl LinkParser.pl LiWa3 309 51230 21768 8.6 1.2 164576 100880 ? SN 12:03 27:30 perl LinkParser.pl LiWa3 310 51230 21801 8.7 1.2 170160 105996 ? SN 12:04 28:00 perl LinkParser.pl LiWa3 311 51230 21819 8.7 1.3 172372 108356 ? SN 12:05 27:41 perl LinkParser.pl LiWa3 312 51230 21822 8.9 1.3 169696 106492 ? SN 12:05 28:20 perl LinkParser.pl LiWa3 313 51230 21837 8.7 1.3 173524 110732 ? SN 12:06 27:48 perl LinkParser.pl LiWa3 314 51230 21887 8.9 1.2 169928 105692 ? SN 12:09 28:01 perl LinkParser.pl LiWa3 315 51230 21902 8.7 1.2 163948 100924 ? SN 12:10 27:21 perl LinkParser.pl LiWa3 316 51230 21917 8.8 1.2 163268 100404 ? RN 12:11 27:34 perl LinkParser.pl LiWa3 317 51230 22102 9.0 1.2 163640 100604 ? SN 12:24 27:06 perl LinkParser.pl LiWa3 318 51230 22104 9.0 1.2 167164 104344 ? RN 12:24 26:52 perl LinkParser.pl LiWa3 319 51230 24354 9.0 1.2 163740 100720 ? SN 12:33 26:09 perl LinkParser.pl LiWa3 320 51230 26992 9.2 1.1 153988 90796 ? SN 13:08 23:35 perl LinkParser.pl LiWa3 321 51230 27068 9.3 1.0 151480 87912 ? SN 13:13 23:17 perl LinkParser.pl LiWa3 322 51230 29534 9.0 1.0 147752 84796 ? SN 13:36 20:30 perl LinkParser.pl LiWa3 323 51230 29694 9.1 0.9 144920 81320 ? SN 13:47 19:38 perl LinkParser.pl LiWa3 324
Because it spawns a large number of processes not under SGE command, LiWa is effectively overloading tools-exec-1202. I have rescheduled the other continuous jobs on that host:
$ qhost -h 'tools-exec-1202' -j tools-exec-1202.eqiad.wmflabs lx26-amd64 4 5.77 7.8G 4.6G 23.9G 0.0 job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 314 0.32354 BCBot4 tools.betaco r 12/30/2015 03:59:30 continuous MASTER 5580 0.80000 vandalstat tools.cluest Rr 12/30/2015 03:54:44 continuous MASTER 41876 0.35471 ghaher69 tools.dexbot Rr 12/30/2015 03:54:44 continuous MASTER 165405 0.54393 analytics- tools.morebo Rr 12/30/2015 03:54:44 continuous MASTER 287009 0.36718 rmiw.w1 tools.yifeib Rr 12/30/2015 03:54:44 continuous MASTER 1518701 0.43664 foo tools.pirsqu Rr 12/30/2015 03:54:44 continuous MASTER 1967777 0.30862 linkwatche tools.linkwa r 01/05/2016 05:25:51 continuous MASTER 1808912 0.41793 gpy tools.gpy r 11/21/2015 20:19:43 task@tools MASTER $ qmod -rj 314 5580 41876 165405 287009 1518701 Pushed rescheduling of job 314 on host tools-exec-1202.eqiad.wmflabs Pushed rescheduling of job 5580 on host tools-exec-1202.eqiad.wmflabs Pushed rescheduling of job 41876 on host tools-exec-1202.eqiad.wmflabs Pushed rescheduling of job 165405 on host tools-exec-1202.eqiad.wmflabs Pushed rescheduling of job 287009 on host tools-exec-1202.eqiad.wmflabs Pushed rescheduling of job 1518701 on host tools-exec-1202.eqiad.wmflabs valhallasw@tools-bastion-02:~$ qhost -h 'tools-exec-1202' -j HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - tools-exec-1202.eqiad.wmflabs lx26-amd64 4 6.24 7.8G 4.6G 23.9G 0.0 job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 1967777 0.30862 linkwatche tools.linkwa r 01/05/2016 05:25:51 continuous MASTER 1808912 0.41793 gpy tools.gpy r 11/21/2015 20:19:43 task@tools MASTER
so it's now effectively a linkwatcher-only host. I haven't killed any processes.
@Beetstra, can you make sure this doesn't happen?