Page MenuHomePhabricator

[tools,nfs,infra] Address tools NFS getting stuck with processes in D state
Closed, ResolvedPublic

Description

Tracking task for investigation and fixes for the recurring issue of "NFS periodically gets stuck in tools". The issue has and is causing varying amounts of grief from "operators have to manually reboot stuck workers" (automation task is T348662) to "toolforge bastion is down" (cfr T404047).

Below is sort of a braindump of where I (Filippo) am at with understanding this issue and its potential fixes. Also worth noting: the scope of this issue is nfs k8s workers getting stuck on rw NFS mountpoints (tools project/home data). A separate and potentially related issue is better resiliency from NFS ro mounts when NFS servers go down (e.g. T391369)

The symptoms are:

  • NFS workers getting processes stuck in D state and not recovering
  • This is a consequence of losing connectivity to the NFS server (nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying in journalctl --dmesg | grep 'server tools')
  • In some cases the kernel reports the NFS server coming back nfs: server ... OK in dmesg)
  • In some cases the workers are able to recover by themselves, AFAICT whether or not the NFS is reported as OK

My understanding is that the kernel reports "still trying" message when it has tried retrans times and each time the RPC has timed out after timeo deciseconds.

My recollection is that we used to have this problem more often when instances would periodically lose connectivity (cfr T400223) although I couldn't find any data to back this claim up (metricsinfra data is gone, and I'm not sure where to find alert history).

Having said that, we do still have the problem from time to time, for example: https://grafana.wmcloud.org/goto/y6WZnvjHg?orgId=1

2025-09-15-115611_2838x1302_scrot.png (1×2 px, 227 KB)

Most recently I took a stack dump of stuck processes on nfs-66 (echo w > /proc/sysrq-trigger) in P83304 . Since pids are mentioned, the ps dump is at P83314, while a dump of /proc/pid/fd for affected pids is at P83318. A dump of /proc/locks is at P83319.

Some things I noticed:

Related Objects

Mentioned In
rCCKB77ef275f12f8: nfs: fix the migrate_service attach + prepare volume race
T407206: replica.my.cnf missing for tool glamspore
rCCKBae388eb98911: nfs: force volume state available during migration
T406688: Cinder volumes getting stuck on 'reserved' after detach
T403927: Jobs failing with no logs
T401812: Migrate WMCS-managed NFS servers off of Bullseye
rCCKB3b4375bc4e5b: wmcs_libs: remove 'delete' special case in _get_full_command
rCCKB9ed0ff5502ed: wmcs_libs: add network_id to NeutronPort
rCCKB7446d01ba1bd: nfs: make add_server idempotent for service IP
rCCKB5150cc42d0cf: wmcs_libs: add get_ip_address_family
T400957: Job not restarting despite liveness probe failures
T404833: [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet
T404322: wmf-auto-restart can get wedged on nfs4 mounts even when the filesystem is excluded
Mentioned Here
T404833: [infra,puppet,nfs] 2025-09-17 tools-k8s-worker-nfs-17 failing to run puppet
T365772: Configure cache store for Gitlab WMCS runners
T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere
T362093: [infra] Add alert when workers have a sustained large amount of D processes
T362690: [infra] NFS hangs in some workers until the worker is rebooted
T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14)
T383238: [nfs] 2025-01-08 tools-nfs outage
T388965: 2025-03-15 Tools NFS hiccup
T387005: [infra] Toolforge: migrate to Debian Bookworm or later
T401812: Migrate WMCS-managed NFS servers off of Bullseye
T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge
T348662: [wmcs-cookbooks,toolforge,nfs] automate cleanup of D state webservices by deleting the stuck pod
T400223: Investigate daily disconnections of IRC bots hosted in Toolforge
T404047: toolforge ssh login hangs right before prompt
P83304 (An Untitled Masterwork)
P83314 (An Untitled Masterwork)
P83318 (An Untitled Masterwork)
P83319 (An Untitled Masterwork)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dcaro changed the task status from Open to In Progress.Sep 15 2025, 1:03 PM
dcaro assigned this task to fgiunchedi.
dcaro edited projects, added Toolforge (Toolforge iteration 24); removed Toolforge.
dcaro triaged this task as High priority.Sep 15 2025, 1:07 PM
fgiunchedi raised the priority of this task from High to Needs Triage.Sep 15 2025, 1:07 PM
fgiunchedi updated the task description. (Show Details)

re: nfs server update I'm reading https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Runbooks/Create_an_NFS_server which covers both creating a new server and failing over. I'll be testing the procedure in toolsbeta first

Also as pointed out by @taavi we're looking at changing the VIP address, as opposed to VIP failover, because the new servers will be in the new dual stack network

I did some tests in testlabs today:

  1. Created a nfs-client-2 instance with Trixie for client testing. Mounts are present via mount_nfs: true in prefix puppet.
  2. nfs.add_server cookbook worked as expected per wikitech docs
    1. The exact invocation: cookbook wmcs.nfs.add_server --project testlabs --flavor 41f0ea41-75ca-44a6-be66-bb56b2a90721 --image cce38072-04a6-48ce-bf38-21bad96e042b --network d8a16ddf-c01f-4f22-8b67-8ed18b4b1b45 testlabs-nfs
  3. the service IP was not created by the cookbook as we're adding a new server, as instructed and expected
  4. I didn't run nfs.migrate_service because it is bound to fail anyways due to the network change
    • I manually validated that the call to wmcs-prepare-cinder-volume the nfs.migrate_service will not format a block device with an existing filesystem

To deal with the network migration I was thinking of the following strategy:

  1. Create a service IP in the new network, if one does not already exist, and attach it to the new server, likely as part of nfs.add_server. Do not change the DNS records at this stage.
  2. Make sure the service IP DNS record has a low TTL
  3. Within nfs.migrate_service during the "move service IP" phase:
    1. detect that networks are changing between to/from host
    2. flip DNS to the new service IP instead of detaching/attaching the old service IP

That plan looks good to me. I haven't tested the add_server cookbook in a long time so I'm glad it's still working.

This is definitely going to cause noticeable downtime for clients, and will probably cause clients (tools nfs workers) to lock up. So we'll need to do this in a scheduled outage window.

Today we got a few more nodes stuck, a quick look into tools-k8s-worker-nfs-17.tools.eqiad1.wikimedia.cloud showed that there was a couple OOM events ~5h before the logs:

Sep 16 21:44:08 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 16 21:44:12 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying

The behavior of the processes was similar to the experienced one, at some point some tools get stuck and then lsof from wmf-auto-restarts starts getting stuck on every run piling up.

There was also a single kubelet thread in D state (seen with htop, not ps aux), lsof on it did not show any open files under the NFS directories, resarting kubelet brought it back online, though the other processes were still stuck.

That remidnded me of some old but related tasks:

From today also, on tools-k8s-worker-nfs-17:

root@tools-k8s-worker-nfs-17:~# journalctl  --boot -1 | grep tools-nfs
Aug 23 13:48:17 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Aug 23 13:48:17 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Aug 23 13:48:23 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Aug 24 16:14:16 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
Aug 24 16:14:17 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
Aug 24 16:14:17 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
Sep 05 00:33:53 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 06 00:19:14 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
Sep 12 18:09:08 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 12 18:09:10 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 12 18:09:11 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 14 12:08:58 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 16 21:44:08 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Sep 16 21:44:12 tools-k8s-worker-nfs-17 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying

It's weird that it reports only the not responding sometimes, and the OK some other times, and the intervals are weird too.

Change #1189791 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] wmcs_libs: add get_address_family

https://gerrit.wikimedia.org/r/1189791

Change #1189792 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] nfs: make add_server idempotent for service IP

https://gerrit.wikimedia.org/r/1189792

Change #1189866 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] wmcs_libs: add network_id to NeutronPort

https://gerrit.wikimedia.org/r/1189866

Change #1189867 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] wmcs_libs: add optional ttl to recordset_create

https://gerrit.wikimedia.org/r/1189867

dcaro renamed this task from Address tools NFS getting stuck with processes in D state to [tools,nfs,infra] Address tools NFS getting stuck with processes in D state.Sep 21 2025, 9:13 AM

Change #1189791 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] wmcs_libs: add get_ip_address_family

https://gerrit.wikimedia.org/r/1189791

Change #1189792 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] nfs: make add_server idempotent for service IP

https://gerrit.wikimedia.org/r/1189792

Change #1189867 abandoned by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] wmcs_libs: add optional ttl to recordset_create

Reason:

See my comment re: tofu

https://gerrit.wikimedia.org/r/1189867

Change #1189866 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] wmcs_libs: add network_id to NeutronPort

https://gerrit.wikimedia.org/r/1189866

Change #1190582 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] wmcs_libs: remove 'delete' special case in _get_full_command

https://gerrit.wikimedia.org/r/1190582

Change #1190662 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudnfs: install python3-netifaces

https://gerrit.wikimedia.org/r/1190662

Change #1190666 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] cloudnfs: add Trixie support

https://gerrit.wikimedia.org/r/1190666

Change #1190582 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] wmcs_libs: remove 'delete' special case in _get_full_command

https://gerrit.wikimedia.org/r/1190582

wmcs.nfs cookbooks are ready and I've tested the DNS-based flip (via designate, not opentofu) on testlabs and it worked. From my testing so far, clients will need to be rebooted upon changing DNS, which we were expecting.

Note that special provisions have had to be made for projects with their DNS managed by opentofu, currently tools and toolsbeta. In such cases the user will be prompted to do the opentofu bits themselves.

Change #1190662 merged by Filippo Giunchedi:

[operations/puppet@production] cloudnfs: install python3-netifaces

https://gerrit.wikimedia.org/r/1190662

Change #1190666 merged by Filippo Giunchedi:

[operations/puppet@production] cloudnfs: add Trixie support

https://gerrit.wikimedia.org/r/1190666

toolsbeta NFS server upgrade happened today, not without issue, below the notes I took as I went along

  1. opentofu pipelines can (and will) fail at will. I got particularly unlucky and could not merge my review without a pipeline failure; this is of course serious when the plan pipelines block apply pipelines. A bandaid was implemented in T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere though it is not as effective because of T365772: Configure cache store for Gitlab WMCS runners
  2. There was a stray/extra testlabs-nfs volume in opentofu. Now gone thanks to @dcaro and https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82
  3. The expected volume got stuck in reserved state, unclear yet why. The fix was wmcs-openstack volume set --state available <volume-uuid>.
  4. The migrate_service cookbook is not idempotent, so when it failed to reattach the volume due to the above reserved status, I had to finish the steps manually
  5. The cookbook can't operate unattended on tofu-managed projects. It was decided to not have tofu manage dns records that are supposed to be managed by cookbooks, like nfs. Fix attempt at https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/87
  6. After the flip was done, it was unnecessarily hard to get back to a working state, specifically:
    1. wmcs.toolforge.k8s.reboot_stuck_workers cookbook fails to talk to prometheus.*.toolforge.org, thus rendering the cookbook unoperational. Needs further investigation because I'm pretty sure it used to work, though the solution is to use the production proxy via cookbooks
    2. It wasn't easy to get a list of "stuck" VMs other than a prometheus query or look at panels like https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-30m&to=now&timezone=utc&var-cluster_datasource=P6466A70779AF0C39&var-cluster=toolsbeta&viewPanel=panel-2

A note on the behavior of workers stuck on NFS, some of them might get out of it by themselves with time, like tools-k8s-worker-nfs-61 in the early morning today:

...
[Tue Sep 30 00:37:01 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying                                                                                                                                                                                                                                         
[Tue Sep 30 00:37:02 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying                                                             
[Tue Sep 30 00:37:04 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying                                                             
[Tue Sep 30 00:37:04 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying  
...
[Tue Sep 30 01:05:03 2025] uwsgi invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=985                                                                                                                                                                                                                                            
...
[Tue Sep 30 03:15:05 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK                                                                                       
[Tue Sep 30 03:15:05 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK                                                                                                                                                                                                                                                                   
[Tue Sep 30 03:15:08 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK
[Tue Sep 30 03:15:09 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK

image.png (445×1 px, 115 KB)
(from https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview)

Mentioned in SAL (#wikimedia-cloud) [2025-10-07T08:35:30Z] <godog> move nfs from toolsbeta-nfs-4 to toolsbeta-nfs-5 - T404584

Change #1194135 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] nfs: force volume state available during migration

https://gerrit.wikimedia.org/r/1194135

Change #1194135 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] nfs: force volume state available during migration

https://gerrit.wikimedia.org/r/1194135

Mentioned in SAL (#wikimedia-cloud) [2025-10-13T08:01:54Z] <godog> switch NFS from tools-nfs-2 to tools-nfs-3 - T404584

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:03:22Z] <filippo@cloudcumin1001> START - Cookbook wmcs.nfs.migrate_service (T404584)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:03:31Z] <filippo@cloudcumin1001> END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99) (T404584)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:03:57Z] <filippo@cloudcumin1001> START - Cookbook wmcs.nfs.migrate_service (T404584)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:04:09Z] <filippo@cloudcumin1001> END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99) (T404584)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:04:58Z] <wmbot~godog@r5> START - Cookbook wmcs.nfs.migrate_service (T404584)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:05:18Z] <wmbot~godog@r5> END (FAIL) - Cookbook wmcs.nfs.migrate_service (exit_code=99) (T404584)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-10-13T08:10:42Z] <filippo@cloudcumin1001> START - Cookbook wmcs.nfs.migrate_service (T404584)

The flip happened today, with two snags:

  1. roles were applied via tools-nfs- puppet prefix, which migrate_service is unable to cope with. @taavi fixed it by adding the role to the tools-nfs-3 instance instead
  2. there's a race between the cookbook and wmcs-prepare-cinder-volume where /dev/sdb was not available yet. The simplest fix I can think of is to wait for the volume to be appear in the vm either in wmcs-prepare-cinder-volume or the cookbook

We also had to issue wmcs.vps.instance.force_reboot for stuck nfs workers (i.e. hard reboot) as opposed to wmcs.toolforge.k8s.reboot because the latter would take too long waiting for drain. Doing so also required expiring/kicking nfs clients from clouddumps1001 with https://phabricator.wikimedia.org/T404833#11267850

I also missed the fact that tools-nfs-2 is a big instance g4.cores16.ram64.disk20.10xiops whereas for tools-nfs-3 I defaulted to g4.cores1.ram2.disk20. I'll be resizing the instance either later today or tomorrow to at least 4 cores and 16G ram.

Ok the resize and reboot went well, nfs clients came back by themselves and processes recovered from D state. Total unavailability for toolforge was ~3m as reported by haproxy

2025-10-15-103030_933x951_scrot.png (951×933 px, 111 KB)

2025-10-15-103048_885x747_scrot.png (747×885 px, 46 KB)

Change #1196687 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[cloud/wmcs-cookbooks@main] nfs: fix the migrate_service attach + prepare volume race

https://gerrit.wikimedia.org/r/1196687

Change #1196687 merged by Filippo Giunchedi:

[cloud/wmcs-cookbooks@main] nfs: fix the migrate_service attach + prepare volume race

https://gerrit.wikimedia.org/r/1196687

I'm cautiously optimistic that the issue is not coming back; the two main differences between tools-nfs-2 and tools-nfs-3 are:

  • Debian Bullseye vs Trixie
  • VM flavor g4.cores16.ram64.disk20.10xiops vs g4.cores4.ram8.disk20

As a result of the VM flavor difference the cinder volume is now getting more reads (~250/s) than before (https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&from=now-24h&to=now&timezone=utc&var-project=tools&var-instance=tools-nfs-2&var-instance=tools-nfs-3).

To avoid any doubts, and since it is a quick operation, I'll be switching the flavor of tools-nfs-3 to g4.cores16.ram64.disk20.10xiops tomorrow. With the understanding that we really don't need that many resources to run the NFS server.

Mentioned in SAL (#wikimedia-cloud) [2025-10-21T07:49:19Z] <godog> resize tools-nfs-3 to match tools-nfs-2 (g4.cores16.ram64.disk20.10xiops) - T404584

I'm optimistically and cautiously resolving this task since we haven't seen a re occurrence of this particular issue with tools nfs.