On Mar 23rd the NFS vm serving toolsbeta was deleted by mistake:
This ticket is to do everything needed to get toolsbeta to the previous functional state.
On Mar 23rd the NFS vm serving toolsbeta was deleted by mistake:
This ticket is to do everything needed to get toolsbeta to the previous functional state.
Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:32:40Z] <dcaro> restarting the sge-master (T304672)
Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:43:47Z] <dcaro> restarting the sge-shadow (T304672)
Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:55:04Z] <dcaro> force restarting all the other nfs-bound VMs one by one (T304672)
Mentioned in SAL (#wikimedia-cloud) [2022-03-25T11:31:12Z] <dcaro> All alerting VMs rebooted, checking that everything is "working" (T304672)
K8s looks ok now, the grid has a few queues in alert (the ones with a/u), will debug:
root@toolsbeta-sgegrid-master:~# qhost -q HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - toolsbeta-sgeexec-0901.toolsbeta.eqiad.wmflabs lx-amd64 2 2 2 2 - 3.9G - 511.0M - continuous BC 0/0/50 au task BI 0/0/50 au toolsbeta-sgeexec-0902.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 0.01 7.8G 317.1M 0.0 0.0 continuous BC 0/0/50 task BI 0/0/50 toolsbeta-sgeexec-10-5.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 0.00 7.8G 399.8M 0.0 0.0 continuous BC 0/0/50 task BI 0/0/50 toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 - 7.8G - 24.0M - webgrid-generic B 0/0/256 adu toolsbeta-sgewebgen-10-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 0.00 7.8G 377.3M 24.0M 0.0 webgrid-generic B 0/0/256 toolsbeta-sgewebgen-10-2.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 0.00 7.8G 390.7M 24.0M 0.0 webgrid-generic B 0/0/256 toolsbeta-sgewebgen-10-3.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 0.00 7.8G 377.5M 24.0M 0.0 webgrid-generic B 0/0/256 toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad.wmflabs lx-amd64 2 2 2 2 - 3.9G - 511.0M - webgrid-lighttpd B 0/0/256 au toolsbeta-sgeweblight-10-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 0.13 7.8G 378.9M 24.0M 0.0 webgrid-lighttpd B 0/0/256
Sorry that I didn't see this ticket before. I did build a new working NFS server yesterday but (as @dcaro discovered) lots of VMs were frozen and probably needed rebooting.
For the two hosts that had queues in 'au' status (Alert + Unknown), the solution was to restart the gridengine-exec serivce on them:
root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl status gridengine-exec.service ● gridengine-exec.service - LSB: SGE Execution Daemon init script Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled) Active: active (exited) since Fri 2022-03-25 11:19:04 UTC; 3h 22min ago Docs: man:systemd-sysv-generator(8) Process: 713 ExecStart=/etc/init.d/gridengine-exec start (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 4915) CGroup: /system.slice/gridengine-exec.service Mar 25 11:19:03 toolsbeta-sgewebgrid-lighttpd-0901 systemd[1]: Starting LSB: SGE Execution Daemon init script... Mar 25 11:19:03 toolsbeta-sgewebgrid-lighttpd-0901 gridengine-exec[713]: chown: invalid user: ‘sgeadmin:sgeadmin’ Mar 25 11:19:04 toolsbeta-sgewebgrid-lighttpd-0901 systemd[1]: Started LSB: SGE Execution Daemon init script. root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl stop gridengine-exec.service root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl start gridengine-exec.service
Looking into the other one in 'adu' (Alert + Disabled + Unknown), tried repooling but that was not enough, can't ssh to it directly:
root@toolsbeta-sgegrid-master:~# exec-manage repool toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud ... root@toolsbeta-sgegrid-master:~# qhost -q HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - ... toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64 4 4 4 4 - 7.8G - 24.0M - webgrid-generic B 0/0/256 au
will try the console next.
It was stuck with nfs stuff, so rebooted it, and then restarted the gridengine-exec service, and it's back up and runing.
Will close this task.