Toolschecker alerted at Thu Sept 17 20:03:58 UTC 2020 for a grid cron job and many other grid crons and services died around that time.
The reason is clear from the dmesg output on tools-sgegrid-master:
NFS issues began at:
[Thu Sep 17 19:53:40 2020] INFO: task sge_qmaster:2128 blocked for more than 120 seconds. [Thu Sep 17 19:53:40 2020] Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3 [Thu Sep 17 19:53:40 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Sep 17 19:53:40 2020] sge_qmaster D 0 2128 1 0x00000000 [Thu Sep 17 19:53:40 2020] ffff9f45314eda80 0000000000000000 ffff9f4532253000 ffff9f453fc98980 [Thu Sep 17 19:53:40 2020] ffff9f4536343080 ffffbea781adfb90 ffffffffafc144b9 ffffffffc075f3c0 [Thu Sep 17 19:53:40 2020] 0000000000000000 ffff9f453fc98980 ffffffffafc19364 ffff9f4532253000 [Thu Sep 17 19:53:40 2020] Call Trace: [Thu Sep 17 19:53:40 2020] [<ffffffffafc144b9>] ? __schedule+0x239/0x6f0 ...snip... [Thu Sep 17 19:54:09 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:10 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying [Thu Sep 17 19:54:12 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, still trying
and resolved at
[Thu Sep 17 20:04:10 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet not responding, timed out [Thu Sep 17 20:04:10 2020] nfs: server nfs-tools-project.svc.eqiad.wmnet OK
This precise pattern happened on all NFS client VMs running on cloudvirt1036 at that time and nowhere else.
tools-k8s-worker-74 was moved to that host during that window and was still spinning up at the time, so it doesn't have mention of an NFS disconnect. We don't see any significant metrics or errors on that hypervisor (so far) except a drop in CPU/RAM usage (probably because the grid ceased to function and it was all grid nodes).