Page MenuHomePhabricator

tools-sgeexec-0908.tools.eqiad1.wikimedia.cloud is misbehaving
Open, Needs TriagePublic

Description

Got these weird emails:

From: Cron Daemon <root@tools.wmflabs.org>
To: prometheus@tools.wmflabs.org
Subject: Cron <prometheus@tools-sgeexec-0908> /usr/local/bin/prometheus-local-crontabs
Date: Sat, 28 Nov 2020 22:39:20 +0000

/usr/local/bin/prometheus-local-crontabs: line 27: 10211 Segmentation fault      /usr/bin/sudo -u root /bin/ls -1 /var/spool/cron/crontabs/
To: root@tools.wmflabs.org
Subject: *** SECURITY information for tools-sgeexec-0908.tools.eqiad.wmflabs ***
From: "Prometheus daemon,,," <prometheus@tools.wmflabs.org>
Date: Sat, 28 Nov 2020 22:39:20 +0000

tools-sgeexec-0908.tools.eqiad.wmflabs : Nov 28 22:39:20 : prometheus : problem with defaults entries ; TTY=unknown ; PWD=/var/lib/prometheus ; USER=root ;
To: root@tools.wmflabs.org
Subject: *** SECURITY information for tools-sgeexec-0908.tools.eqiad.wmflabs ***
From: "Prometheus daemon,,," <prometheus@tools.wmflabs.org>
Date: Sat, 28 Nov 2020 22:39:20 +0000

tools-sgeexec-0908.tools.eqiad.wmflabs : Nov 28 22:39:19 : prometheus : problem with defaults entries ; TTY=unknown ; PWD=/var/lib/prometheus ; USER=root ;

I can't SSH in, from my laptop (via bastion), via cumin, or via clush. It just sits there waiting. Even as root. 0907 and 0909 are fine.
Hypervisor is cloudvirt1031.eqiad.wmnet, I checked and I can SSH to other stuff running there okay.
Prometheus data has stopped abruptly: https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1d&g0.end_input=2020-11-28%2023%3A27&g0.expr=node_cpu_seconds_total%7Bproject%3D%22tools%22%2C%20instance%3D%22tools-sgeexec-0908%22%7D&g0.tab=0

krenair@tools-sgegrid-master:~$ qhost -j -h tools-sgeexec-0908.tools.eqiad1.wikimedia.cloud
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
tools-sgeexec-0908.tools.eqiad.wmflabs lx-amd64        4    4    4    4     -    7.8G       -   23.9G       -
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID 
   ----------------------------------------------------------------------------------------------
    835100 0.31545 start      tools.usrd-t Rr    09/18/2020 00:50:04 continuous MASTER        
     85368 0.31456 musescore- tools.archiv r     09/19/2020 00:05:12 continuous MASTER        
    556629 0.30394 AnomieBOT- tools.anomie r     09/30/2020 16:11:53 continuous MASTER        
   1302687 0.28747 KTtrwiki   tools.ket-bo Rr    10/29/2020 17:25:15 continuous MASTER        
   3229812 0.25007 ewimage    aka          r     11/28/2020 21:06:08 task@tools MASTER        
krenair@tools-sgegrid-master:~$

I rescheduled the 4 continuous jobs with qmod -rj <jid> on tools-sgegrid-master.

qstat -f -explain a shows

task@tools-sgeexec-0908.tools. BI    0/1/50         -NA-     lx-amd64      au
 error: no value for "np_load_avg" because execd is in unknown state
continuous@tools-sgeexec-0908. BC    0/0/50         -NA-     lx-amd64      au
 error: no value for "np_load_avg" because execd is in unknown state

(nothing else has a state or NA for load avg)
Will leave it running instead of attempting a reboot as there's a non-continuous user job there by @Aka that seems to be functioning in some way (writing to NFS), and leave this ticket for someone to look at on Monday. I figure SGE won't try to schedule anything new there while it's in alarm/unreachable state.

Event Timeline

Some of the continuous jobs that were stopped (except anomie's) have issued root@ failure emails with errors like can't get password entry for user "tools.ket-bot" (I imagine that given how broken this instance is, LDAP connectivity is one of the issues), and there's another couple more 'problems with defaults entries' emails too. All around 01:12

I just went and checked on this again and found I can SSH in, tasks are running on it, SGE has cleared the alarm/unreachable flags, based on the prometheus data it came back at 02:12:25 (after having stopped at 21:36:25), and according to uptime it hasn't been restarted.