Tracking task for investigation and fixes for the recurring issue of "NFS periodically gets stuck in tools". The issue has and is causing varying amounts of grief from "operators have to manually reboot stuck workers" (automation task is T348662) to "toolforge bastion is down" (cfr T404047).
Below is sort of a braindump of where I (Filippo) am at with understanding this issue and its potential fixes. Also worth noting: the scope of this issue is nfs k8s workers getting stuck on rw NFS mountpoints (tools project/home data). A separate and potentially related issue is better resiliency from NFS ro mounts when NFS servers go down (e.g. T391369)
The symptoms are:
- NFS workers getting processes stuck in D state and not recovering
- This is a consequence of losing connectivity to the NFS server (nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying in journalctl --dmesg | grep 'server tools')
- In some cases the kernel reports the NFS server coming back nfs: server ... OK in dmesg)
- In some cases the workers are able to recover by themselves, AFAICT whether or not the NFS is reported as OK
My understanding is that the kernel reports "still trying" message when it has tried retrans times and each time the RPC has timed out after timeo deciseconds.
My recollection is that we used to have this problem more often when instances would periodically lose connectivity (cfr T400223) although I couldn't find any data to back this claim up (metricsinfra data is gone, and I'm not sure where to find alert history).
Having said that, we do still have the problem from time to time, for example: https://grafana.wmcloud.org/goto/y6WZnvjHg?orgId=1
Most recently I took a stack dump of stuck processes on nfs-66 (echo w > /proc/sysrq-trigger) in P83304 . Since pids are mentioned, the ps dump is at P83314, while a dump of /proc/pid/fd for affected pids is at P83318. A dump of /proc/locks is at P83319.
Some things I noticed:
- Once a particular directory gets stuck (I suspect for writing) then reading it also gets stuck, which is why for example lsof does get stuck (ignoring for a second the fact that the -e option to exclude nfs mount points doesn't work as intended)
- nfs-tools-2 is running bullseye and its kernel, I suspect a relatively low effort test, which we have to do anyways, is get the nfs server on trixie and its kernel. This is T387005: [infra] Toolforge: migrate to Debian Bookworm or later and T401812: Migrate WMCS-managed NFS servers off of Bullseye specifically



