Page MenuHomePhabricator

toolsbeta: restore nfs that was mistakenly deleted
Closed, ResolvedPublic

Description

On Mar 23rd the NFS vm serving toolsbeta was deleted by mistake:

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/abc5541611661e3b011908cf02af5ca647867b4d%5E%21/#F0

This ticket is to do everything needed to get toolsbeta to the previous functional state.

Event Timeline

dcaro changed the task status from Open to In Progress.Mar 25 2022, 9:57 AM
dcaro triaged this task as High priority.
dcaro created this task.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:32:40Z] <dcaro> restarting the sge-master (T304672)

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:43:47Z] <dcaro> restarting the sge-shadow (T304672)

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T10:55:04Z] <dcaro> force restarting all the other nfs-bound VMs one by one (T304672)

Mentioned in SAL (#wikimedia-cloud) [2022-03-25T11:31:12Z] <dcaro> All alerting VMs rebooted, checking that everything is "working" (T304672)

K8s looks ok now, the grid has a few queues in alert (the ones with a/u), will debug:

root@toolsbeta-sgegrid-master:~# qhost -q
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
toolsbeta-sgeexec-0901.toolsbeta.eqiad.wmflabs lx-amd64        2    2    2    2     -    3.9G       -  511.0M       -
   continuous           BC    0/0/50        au
   task                 BI    0/0/50        au
toolsbeta-sgeexec-0902.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.01    7.8G  317.1M     0.0     0.0
   continuous           BC    0/0/50
   task                 BI    0/0/50
toolsbeta-sgeexec-10-5.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  399.8M     0.0     0.0
   continuous           BC    0/0/50
   task                 BI    0/0/50
toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4     -    7.8G       -   24.0M       -
   webgrid-generic      B     0/0/256       adu
toolsbeta-sgewebgen-10-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  377.3M   24.0M     0.0
   webgrid-generic      B     0/0/256
toolsbeta-sgewebgen-10-2.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  390.7M   24.0M     0.0
   webgrid-generic      B     0/0/256
toolsbeta-sgewebgen-10-3.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.00    7.8G  377.5M   24.0M     0.0
   webgrid-generic      B     0/0/256
toolsbeta-sgewebgrid-lighttpd-0901.toolsbeta.eqiad.wmflabs lx-amd64        2    2    2    2     -    3.9G       -  511.0M       -
   webgrid-lighttpd     B     0/0/256       au
toolsbeta-sgeweblight-10-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4  0.13    7.8G  378.9M   24.0M     0.0
   webgrid-lighttpd     B     0/0/256

Sorry that I didn't see this ticket before. I did build a new working NFS server yesterday but (as @dcaro discovered) lots of VMs were frozen and probably needed rebooting.

For the two hosts that had queues in 'au' status (Alert + Unknown), the solution was to restart the gridengine-exec serivce on them:

root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl status gridengine-exec.service
● gridengine-exec.service - LSB: SGE Execution Daemon init script
   Loaded: loaded (/etc/init.d/gridengine-exec; generated; vendor preset: enabled)
   Active: active (exited) since Fri 2022-03-25 11:19:04 UTC; 3h 22min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 713 ExecStart=/etc/init.d/gridengine-exec start (code=exited, status=0/SUCCESS)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/gridengine-exec.service

Mar 25 11:19:03 toolsbeta-sgewebgrid-lighttpd-0901 systemd[1]: Starting LSB: SGE Execution Daemon init script...
Mar 25 11:19:03 toolsbeta-sgewebgrid-lighttpd-0901 gridengine-exec[713]: chown: invalid user: ‘sgeadmin:sgeadmin’
Mar 25 11:19:04 toolsbeta-sgewebgrid-lighttpd-0901 systemd[1]: Started LSB: SGE Execution Daemon init script.
root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl stop gridengine-exec.service
root@toolsbeta-sgewebgrid-lighttpd-0901:~# systemctl start gridengine-exec.service

Looking into the other one in 'adu' (Alert + Disabled + Unknown), tried repooling but that was not enough, can't ssh to it directly:

root@toolsbeta-sgegrid-master:~# exec-manage repool toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud
...
root@toolsbeta-sgegrid-master:~# qhost -q
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
...
toolsbeta-sgewebgen-09-1.toolsbeta.eqiad1.wikimedia.cloud lx-amd64        4    4    4    4     -    7.8G       -   24.0M       -
   webgrid-generic      B     0/0/256       au

will try the console next.

It was stuck with nfs stuff, so rebooted it, and then restarted the gridengine-exec service, and it's back up and runing.

Will close this task.