Got these weird emails:
From: Cron Daemon <root@tools.wmflabs.org> To: prometheus@tools.wmflabs.org Subject: Cron <prometheus@tools-sgeexec-0908> /usr/local/bin/prometheus-local-crontabs Date: Sat, 28 Nov 2020 22:39:20 +0000 /usr/local/bin/prometheus-local-crontabs: line 27: 10211 Segmentation fault /usr/bin/sudo -u root /bin/ls -1 /var/spool/cron/crontabs/
To: root@tools.wmflabs.org Subject: *** SECURITY information for tools-sgeexec-0908.tools.eqiad.wmflabs *** From: "Prometheus daemon,,," <prometheus@tools.wmflabs.org> Date: Sat, 28 Nov 2020 22:39:20 +0000 tools-sgeexec-0908.tools.eqiad.wmflabs : Nov 28 22:39:20 : prometheus : problem with defaults entries ; TTY=unknown ; PWD=/var/lib/prometheus ; USER=root ;
To: root@tools.wmflabs.org Subject: *** SECURITY information for tools-sgeexec-0908.tools.eqiad.wmflabs *** From: "Prometheus daemon,,," <prometheus@tools.wmflabs.org> Date: Sat, 28 Nov 2020 22:39:20 +0000 tools-sgeexec-0908.tools.eqiad.wmflabs : Nov 28 22:39:19 : prometheus : problem with defaults entries ; TTY=unknown ; PWD=/var/lib/prometheus ; USER=root ;
I can't SSH in, from my laptop (via bastion), via cumin, or via clush. It just sits there waiting. Even as root. 0907 and 0909 are fine.
Hypervisor is cloudvirt1031.eqiad.wmnet, I checked and I can SSH to other stuff running there okay.
Prometheus data has stopped abruptly: https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1d&g0.end_input=2020-11-28%2023%3A27&g0.expr=node_cpu_seconds_total%7Bproject%3D%22tools%22%2C%20instance%3D%22tools-sgeexec-0908%22%7D&g0.tab=0
krenair@tools-sgegrid-master:~$ qhost -j -h tools-sgeexec-0908.tools.eqiad1.wikimedia.cloud HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - tools-sgeexec-0908.tools.eqiad.wmflabs lx-amd64 4 4 4 4 - 7.8G - 23.9G - job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 835100 0.31545 start tools.usrd-t Rr 09/18/2020 00:50:04 continuous MASTER 85368 0.31456 musescore- tools.archiv r 09/19/2020 00:05:12 continuous MASTER 556629 0.30394 AnomieBOT- tools.anomie r 09/30/2020 16:11:53 continuous MASTER 1302687 0.28747 KTtrwiki tools.ket-bo Rr 10/29/2020 17:25:15 continuous MASTER 3229812 0.25007 ewimage aka r 11/28/2020 21:06:08 task@tools MASTER krenair@tools-sgegrid-master:~$
I rescheduled the 4 continuous jobs with qmod -rj <jid> on tools-sgegrid-master.
qstat -f -explain a shows
task@tools-sgeexec-0908.tools. BI 0/1/50 -NA- lx-amd64 au error: no value for "np_load_avg" because execd is in unknown state continuous@tools-sgeexec-0908. BC 0/0/50 -NA- lx-amd64 au error: no value for "np_load_avg" because execd is in unknown state
(nothing else has a state or NA for load avg)
Will leave it running instead of attempting a reboot as there's a non-continuous user job there by @Aka that seems to be functioning in some way (writing to NFS), and leave this ticket for someone to look at on Monday. I figure SGE won't try to schedule anything new there while it's in alarm/unreachable state.