Page MenuHomePhabricator

toolsbeta: restore nfs that was mistakenly deleted
Closed, ResolvedPublic

Description

On Mar 23rd the NFS vm serving toolsbeta was deleted by mistake:

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/abc5541611661e3b011908cf02af5ca647867b4d%5E%21/#F0

This ticket is to do everything needed to get toolsbeta to the previous functional state.

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:32:40Z] <dcaro> restarting the sge-master (T304672)

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:43:47Z] <dcaro> restarting the sge-shadow (T304672)

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:55:04Z] <dcaro> force restarting all the other nfs-bound VMs one by one (T304672)

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T11:31:12Z] <dcaro> All alerting VMs rebooted, checking that everything is "working" (T304672)

K8s looks ok now, the grid has a few queues in alert (the ones with a/u), will debug:

root@toolsbeta-sgegrid-master:~# qhost -q
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
toolsbeta-sgeexec-0901.toolsbeta.eqiad.wmflabs lx-amd64        2    2    2    2     -    3.9G       -  511.0M       -
   continuous           BC    0/0/50        au
   task                 BI    0/0/50        au
toolsbeta-sgeexec-0902.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.01    7.8G  317.1M     0.0     0.0
   continuous           BC    0/0/50
   task                 BI    0/0/50
toolsbeta-sgeexec-10-5.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  399.8M     0.0     0.0
   continuous           BC    0/0/50
   task                 BI    0/0/50
toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4     -    7.8G       -   24.0M       -
   webgrid-generic      B     0/0/256       adu
toolsbeta-sgewebgen-10-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  377.3M   24.0M     0.0
   webgrid-generic      B     0/0/256
toolsbeta-sgewebgen-10-2.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  390.7M   24.0M     0.0
   webgrid-generic      B     0/0/256
toolsbeta-sgewebgen-10-3.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  377.5M   24.0M     0.0
   webgrid-generic      B     0/0/256
toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad.wmflabs lx-amd64        2    2    2    2     -    3.9G       -  511.0M       -
   webgrid-lighttpd     B     0/0/256       au
toolsbeta-sgeweblight-10-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.13    7.8G  378.9M   24.0M     0.0
   webgrid-lighttpd     B     0/0/256

Sorry that I didn't see this ticket before. I did build a new working NFS server yesterday but (as @dcaro discovered) lots of VMs were frozen and probably needed rebooting.

For the two hosts that had queues in 'au' status (Alert + Unknown), the solution was to restart the gridengine-exec serivce on them:

root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl status gridengine-exec.service
● gridengine-exec.service - LSB: SGE Execution Daemon init script
   Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled)
   Active: active (exited) since Fri 2022-03-25 11:19:04 UTC; 3h 22min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 713 ExecStart=/etc/init.d/gridengine-exec start (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/gridengine-exec.service

Mar 25 11:19:03 toolsbeta-sgewebgrid-lighttpd-0901 systemd[1]: Starting LSB: SGE Execution Daemon init script...
Mar 25 11:19:03 toolsbeta-sgewebgrid-lighttpd-0901 gridengine-exec[713]: chown: invalid user: ‘sgeadmin:sgeadmin’
Mar 25 11:19:04 toolsbeta-sgewebgrid-lighttpd-0901 systemd[1]: Started LSB: SGE Execution Daemon init script.
root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl stop gridengine-exec.service
root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl start gridengine-exec.service

Looking into the other one in 'adu' (Alert + Disabled + Unknown), tried repooling but that was not enough, can't ssh to it directly:

root@toolsbeta-sgegrid-master:~# exec-manage repool toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud
...
root@toolsbeta-sgegrid-master:~# qhost -q
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
...
toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4     -    7.8G       -   24.0M       -
   webgrid-generic      B     0/0/256       au

will try the console next.

It was stuck with nfs stuff, so rebooted it, and then restarted the gridengine-exec service, and it's back up and runing.

Will close this task.