Page MenuHomePhabricator

PAWS server not starting
Closed, ResolvedPublic

Description

What happened?

I am unable to start the PAWS server for my bot, OctraBot. It is similar to T400542 but with different message. The event logs are as follows:

Server requested
2025-10-02T01:29:58.105910Z [Normal] Successfully assigned prod/jupyter--4fctra-42ot to paws-127c-uwce57bvcgrt-node-4
2025-10-02T01:32:01Z [Warning] Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Spawn failed: pod prod/jupyter--4fctra-42ot did not start in 300 seconds!

What should have happened?

The server should start normally.

Event Timeline

Same issue here.

Event log
Server requested
2025-10-02T08:43:15.734058Z [Normal] Successfully assigned prod/jupyter--4aohanbenjamin to paws-127c-uwce57bvcgrt-node-4
2025-10-02T08:45:18Z [Warning] Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
Spawn failed: pod prod/jupyter--4aohanbenjamin did not start in 300 seconds!
dcaro changed the task status from Open to In Progress.Oct 2 2025, 8:58 AM
dcaro claimed this task.
dcaro triaged this task as High priority.

I can reproduce, I can see many events like:

prod        56m         Warning   FailedMount   pod/jupyter--4collovand                              Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        55m         Warning   FailedMount   pod/jupyter--44rkirstyross                           Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        53m         Warning   Unhealthy     pod/hub-678888b5d9-bjzh8                             Readiness probe failed: Get "http://10.100.58.153:8081/hub/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
prod        42m         Warning   FailedMount   pod/jupyter--4collovand                              Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        37m         Warning   FailedMount   pod/jupyter--4aklamo                                 Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        27m         Warning   FailedMount   pod/jupyter--4collovand                              Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        25m         Warning   FailedMount   pod/jupyter--4dineproness-2d2                        Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        24m         Warning   FailedMount   pod/jupyter--4aohanbenjamin                          Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        22m         Warning   FailedMount   pod/jupyter--4aklamo                                 Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        22m         Warning   FailedMount   pod/jupyter--53yrus257                               Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        17m         Warning   FailedMount   pod/jupyter--4aohanbenjamin                          Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        15m         Warning   FailedMount   pod/jupyter--53yrus257                               Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
metrics     14m         Warning   Unhealthy     pod/prometheus-kube-state-metrics-7f979f5c55-bn6fv   Liveness probe failed: HTTP probe failed with statuscode: 503
prod        12m         Warning   FailedMount   pod/jupyter--4collovand                              Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        11m         Warning   FailedMount   pod/jupyter--4aohanbenjamin                          Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        4m39s       Warning   FailedMount   pod/jupyter--53yrus257                               Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        4m33s       Normal    Pulled        pod/hub-678888b5d9-bjzh8                             Container image "quay.io/wikimedia-paws-prod/paws-hub:pr-499" already present on machine
prod        4m33s       Normal    Started       pod/hub-678888b5d9-bjzh8                             Started container hub
prod        4m33s       Normal    Created       pod/hub-678888b5d9-bjzh8                             Created container hub
prod        4m33s       Warning   Unhealthy     pod/hub-678888b5d9-bjzh8                             Readiness probe failed: Get "http://10.100.58.153:8081/hub/health": read tcp 172.16.17.20:56210->10.100.58.153:8081: read: connection reset by peer
prod        4m31s       Warning   Unhealthy     pod/hub-678888b5d9-bjzh8                             Readiness probe failed: Get "http://10.100.58.153:8081/hub/health": dial tcp 10.100.58.153:8081: connect: connection refused
prod        4m30s       Normal    Killing       pod/jupyter--53almanwelah                            Stopping container notebook
prod        36s         Warning   FailedMount   pod/jupyter--53yrus257                               Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
prod        8s          Warning   FailedMount   pod/jupyter--44-43aro-20-28-57-4d-46-29              Unable to attach or mount volumes: unmounted volumes=[home], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition
root@bastion:~#

I'll restart the worker nodes, starting from node-4 that is the one I saw failing when I tried.

After cordoning node-4, things seem to be ok, I'll monitor for a bit and investigate the issue.

So far I can see node-4 has a lot of processes stuck on nfs mounts, all from the paws-nfs.svc.paws.eqiad1.wikimedia.cloud server:

[root@paws-127c-uwce57bvcgrt-node-4 ~]# ps -eo stat,cmd | grep ^D
D    /sbin/umount.nfs4 /var/lib/kubelet/pods/7128f253-5512-4e2f-9213-e6c6febe9e2d/volumes/kubernetes.io~nfs/pawshomes
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13401018 /var/lib/kubelet/pods/cc54e562-47b9-478b-b90e-a63e5a6c0df8/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13401018 /var/lib/kubelet/pods/e7eb783c-f9c6-4b88-9e76-262e4c52853b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73187022 /var/lib/kubelet/pods/91a3ead5-6038-4ff0-93c1-76217a102c07/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73187022 /var/lib/kubelet/pods/7de7cd8f-fb42-4417-8a8a-241738d8d4de/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/b19e36ce-3106-4123-8d11-7199917e22a5/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/efd4a0e1-6af8-4054-85b8-49eba5c88810/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/e9b1f791-56bb-40be-89e5-741e49c2c5eb/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73187022 /var/lib/kubelet/pods/eb129364-2033-42e1-86bd-cb0cb0285956/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/79747912 /var/lib/kubelet/pods/9237a4e7-b810-425b-92f7-693da7524dc9/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/36956696 /var/lib/kubelet/pods/560277a0-f5b2-4775-b01f-e2a23c3059ba/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78588867 /var/lib/kubelet/pods/683121d2-f1e7-4de4-91c4-ae95a974e4cf/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78588867 /var/lib/kubelet/pods/0c36311a-6dda-42dd-9f27-9c9a709432a0/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78706686 /var/lib/kubelet/pods/8f5ebdca-a3f4-45a6-8950-7d6dca93e2d5/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78707135 /var/lib/kubelet/pods/ca9f0923-2462-47bd-b0b5-c0a377d38add/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78707185 /var/lib/kubelet/pods/501e80c4-288b-4a86-b6a9-515fcaa036ca/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78702039 /var/lib/kubelet/pods/298f7bfa-6025-420a-ab32-bee0596e8933/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78706686 /var/lib/kubelet/pods/a34ddfdc-83d7-4d7a-900d-abdc450ccc88/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/6d73357c-e1ca-4241-9a7c-28ab0e4ab97e/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/e5b8e99f-c8f7-44fd-ab46-9bcdb5f36c8c/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/41b35230-04a5-4316-a66a-4d8b44942f18/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13977878 /var/lib/kubelet/pods/bf9a9d35-fe93-43ad-84f5-86b028e62193/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13977878 /var/lib/kubelet/pods/8e1a3e6e-1eef-40db-a98b-aa8ee7720891/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/76527893 /var/lib/kubelet/pods/6110d736-119d-40ca-b16b-205abca5ffea/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/888268e6-0761-414a-bec0-607a7a8df04b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13090523 /var/lib/kubelet/pods/1b82c482-5d33-40e1-ab51-cefa6265e775/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/76527893 /var/lib/kubelet/pods/3b9a862d-08ad-41aa-a929-cf6532f61a74/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13090523 /var/lib/kubelet/pods/62507fbd-4339-4efe-8866-13a4a28fc46b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13090523 /var/lib/kubelet/pods/91d50613-b2d2-4fc1-8396-a1ed1ea48297/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/48969430 /var/lib/kubelet/pods/31eaa32c-b341-490c-9a4c-e2872f95883b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/48969430 /var/lib/kubelet/pods/8b7a5fac-2ad6-47eb-a18a-9f4b4032368c/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/25d049f2-0a78-437d-9378-3f7dfeddd018/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78324083 /var/lib/kubelet/pods/7622fdbf-23d4-49b8-86fd-a1e4cef64462/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78324083 /var/lib/kubelet/pods/96676a53-0b1c-42c6-8f51-89a39e5c2a0c/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78588202 /var/lib/kubelet/pods/91ddcebb-c2f4-4e1b-a04e-337fbd44d48a/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78324083 /var/lib/kubelet/pods/4843ce83-76b1-4ce1-8b25-25ebb6add7f6/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/4b0a9930-1756-479a-b845-a4d3c54cfd1b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78605609 /var/lib/kubelet/pods/37d32ea2-dec1-4319-b177-34e03f4a5876/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78836729 /var/lib/kubelet/pods/b14584b6-cd84-46d4-b85a-bdfd9402be00/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/5d0aeec3-bbe9-433b-8644-856005a0e7c9/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/8b2c6a85-d37b-4ff5-a7ca-27e3361bbc2f/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/54598627 /var/lib/kubelet/pods/0f070300-def2-4c47-95a9-6465ac2872da/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/55827452 /var/lib/kubelet/pods/077cbed0-bcdc-4d0f-8233-422b7617374b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/43193344-9f68-4039-bd68-c409b1bac697/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/55827452 /var/lib/kubelet/pods/41011ab3-a26d-4ff9-81ce-cb0f028c9015/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/53f1ae09-5d37-4334-aa26-085d7c1a8608/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/138240 /var/lib/kubelet/pods/3786a616-2323-4cfe-8dc9-55e77019b19e/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/42f4a258-13d7-494f-ba54-a74868794e3e/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/79290973 /var/lib/kubelet/pods/1fc0c564-59f6-4ad0-b4f2-91d4074ec8a0/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/75469353 /var/lib/kubelet/pods/918d60aa-0996-43b8-87b0-464558806e60/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/138240 /var/lib/kubelet/pods/a975f336-787e-466a-a551-793651dc242f/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73156902 /var/lib/kubelet/pods/95c7e945-f0eb-46ba-ad4f-8425b93df654/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/75469353 /var/lib/kubelet/pods/1adaa17a-b119-4554-bf0f-2f708f215774/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73156902 /var/lib/kubelet/pods/3c8c4f7b-ed5a-4d35-9820-fc1b96bdb176/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/18783750-7e9f-4d09-b79b-3eabc34eb9c2/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/75469353 /var/lib/kubelet/pods/f27bfb77-7dfc-4e1f-8ac8-028a1152a04d/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73156902 /var/lib/kubelet/pods/061338fc-1a9e-4cad-b535-2ce3c8aa4ac3/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73156902 /var/lib/kubelet/pods/8262d5f2-318e-4421-8275-c2fea8334a06/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/64396529 /var/lib/kubelet/pods/00c1baf7-abf7-47dd-b53f-078d988d52e4/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/75469353 /var/lib/kubelet/pods/45ccc5af-73cb-4e10-ac7a-593cdc8f0f9b/volumes/kubernetes.io~nfs/home -o rw
D    /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/78710147 /var/lib/kubelet/pods/e3fec59b-4a5e-4847-a501-f8d6258c5bf3/volumes/kubernetes.io~nfs/home -o rw

some of them have been stuck since yesterday:

[root@paws-127c-uwce57bvcgrt-node-4 ~]# ps aux | grep ' D ' | head
root     3371070  0.0  0.0   5860  2944 ?        D    Oct01   0:00 /sbin/umount.nfs4 /var/lib/kubelet/pods/7128f253-5512-4e2f-9213-e6c6febe9e2d/volumes/kubernetes.io~nfs/pawshomes
root     3401280  0.0  0.0   5860  3200 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13401018 /var/lib/kubelet/pods/cc54e562-47b9-478b-b90e-a63e5a6c0df8/volumes/kubernetes.io~nfs/home -o rw
root     3408237  0.0  0.0   5860  3200 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/13401018 /var/lib/kubelet/pods/e7eb783c-f9c6-4b88-9e76-262e4c52853b/volumes/kubernetes.io~nfs/home -o rw
root     3461930  0.0  0.0   5860  3072 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73187022 /var/lib/kubelet/pods/91a3ead5-6038-4ff0-93c1-76217a102c07/volumes/kubernetes.io~nfs/home -o rw
root     3468582  0.0  0.0   5860  3200 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73187022 /var/lib/kubelet/pods/7de7cd8f-fb42-4417-8a8a-241738d8d4de/volumes/kubernetes.io~nfs/home -o rw
root     3469873  0.0  0.0   5860  3200 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/b19e36ce-3106-4123-8d11-7199917e22a5/volumes/kubernetes.io~nfs/home -o rw
root     3472428  0.0  0.0   5860  3072 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/efd4a0e1-6af8-4054-85b8-49eba5c88810/volumes/kubernetes.io~nfs/home -o rw
root     3474873  0.0  0.0   5860  3328 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/66092386 /var/lib/kubelet/pods/e9b1f791-56bb-40be-89e5-741e49c2c5eb/volumes/kubernetes.io~nfs/home -o rw
root     3476261  0.0  0.0   5860  3328 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/73187022 /var/lib/kubelet/pods/eb129364-2033-42e1-86bd-cb0cb0285956/volumes/kubernetes.io~nfs/home -o rw
root     3477876  0.0  0.0   5860  3200 ?        D    Oct01   0:00 /sbin/mount.nfs paws-nfs.svc.paws.eqiad1.wikimedia.cloud:/srv/paws/project/paws/userhomes/79747912 /var/lib/kubelet/pods/9237a4e7-b810-425b-92f7-693da7524dc9/volumes/kubernetes.io~nfs/home -o rw

And it seems the umount was the first.

I'm not seeing much in the logs :/,
there was a couple OOM events happening yesterday:

[Wed Oct  1 16:31:06 2025] qemu-system-x86 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=978                                                                                                                                                                                                          
...
[Wed Oct  1 16:31:06 2025] Out of memory: Killed process 2294857 (mygame) total-vm:2472304kB, anon-rss:2427496kB, file-rss:128kB, shmem-rss:0kB, UID:52771 pgtables:4824kB oom_score_adj:978    
...                                                                                                                                                       
[Wed Oct  1 17:43:28 2025] mygame invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=978    
...
[Wed Oct  1 17:43:28 2025] Out of memory: Killed process 2305284 (mygame) total-vm:3498524kB, anon-rss:2426752kB, file-rss:256kB, shmem-rss:0kB, UID:52771 pgtables:4952kB oom_score_adj:978

that matches the time the first umount got stuck:

[root@paws-127c-uwce57bvcgrt-node-4 ~]# ps -eo stat,cmd,start | grep ^D
D    /sbin/umount.nfs4 /var/lib/ 17:44:10
...

So I'm starting to lean that the OOM event somehow messes up with NFS mounts (this correlation exists also in toolforge)

I tried creating a process that writes to the nfs mount (journalctl -f | tee -a outfile), and kill it while it's writing but did not cause the issue.

I'll close this for now and maybe investigate ways of avoiding the system getting out of memory and leave some always for sssd/nfs stuff to be able to function correctly.