Page MenuHomePhabricator

If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge
Open, HighPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

  • Puppet starts failing across Cloud VPS with the error below
  • Toolforge starts misbehaving (see graphs below)

What should have happened instead?:

Cloud VPS and Toolforge should continue to work fine, ignoring the inactive clouddumps host.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information

2025-04-08T14:51:12.007185+00:00 tools-k8s-worker-nfs-75 puppet-agent[3770568]: '/usr/bin/timeout -k 5s 20s /bin/mkdir -p /mnt/nfs/dumps-clouddumps1001.wikimedia.org' returned 124 instead of one of [0]
2025-04-08T14:51:12.013842+00:00 tools-k8s-worker-nfs-75 puppet-agent[3770568]: (/Stage[main]/Profile::Wmcs::Nfsclient/Labstore::Nfs_mount[clouddumps1001.wikimedia.org]/Exec[create-/mnt/nfs/dumps-clouddumps1001.wikimedia.org]/returns) change from 'notrun' to ['0'] failed: '/usr/bin/timeout -k 5s 20s /bin/mkdir -p /mnt/nfs/dumps-clouddumps1001.wikimedia.org' returned 124 instead of one of [0] (corrective)
2025-04-08T14:51:12.016681+00:00 tools-k8s-worker-nfs-75 puppet-agent[3770568]: (/Stage[main]/Profile::Wmcs::Nfsclient/Labstore::Nfs_mount[clouddumps1001.wikimedia.org]/Exec[ensure-nfs-clouddumps1001.wikimedia.org]) Dependency Exec[create-/mnt/nfs/dumps-clouddumps1001.wikimedia.org] has failures: true
2025-04-08T14:51:12.016889+00:00 tools-k8s-worker-nfs-75 puppet-agent[3770568]: (/Stage[main]/Profile::Wmcs::Nfsclient/Labstore::Nfs_mount[clouddumps1001.wikimedia.org]/Exec[ensure-nfs-clouddumps1001.wikimedia.org]) Skipping because of failed dependencies

Toolforge graphs (clouddumps1001 was down approximately from 14:10 UTC to 15:10 UTC):

Screenshot 2025-04-08 at 17.38.51.png (1×3 px, 333 KB)

Screenshot 2025-04-08 at 17.39.49.png (684×1 px, 123 KB)

Screenshot 2025-04-08 at 17.40.41.png (1×3 px, 831 KB)

Event Timeline

fnegri claimed this task.
fnegri triaged this task as High priority.

@Andrew do you have any thoughts on this? I think ideally we would find a way to umount the inactive NFS server from Puppet.

This is because we keep the primary and secondary hosts mounted at the same time, right? Is that something we need to do at all with read-only nfs mounts, or could we just have puppet mount the active server and actively disconnect the passive one based on a hiera switch?

(This is assuming that the the VMs didn't otherwise freak out when they lost the clouddumps access)

This is because we keep the primary and secondary hosts mounted at the same time, right?

I believe that's the main source of trouble, but I'm not sure it's the only one. :)

could we just have puppet mount the active server and actively disconnect the passive one based on a hiera switch?

I think this could work, except some tools NFS workers could have some active connections, so not sure what would happen there if Puppet tries to umount.
We can probably do some testing in codfw.

My recollection is that we hard-mount nfs servers to prevent data corruption but it causes VMs to freak out on disconnection. If we're doing that with r/o mounts then that's probably just wrong.

Will have to double check how NFS options behave, but I we are trying to use soft:

tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/project   /mnt/nfs/labstore-secondary-tools-project       nfs     vers=4.2,bg,intr,sec=sys,proto=tcp,noatime,lookupcache=all,nofsc,rw,hard        0       0
tools-nfs.svc.tools.eqiad1.wikimedia.cloud:/srv/tools/home      /mnt/nfs/labstore-secondary-tools-home  nfs     vers=4.2,bg,intr,sec=sys,proto=tcp,noatime,lookupcache=all,nofsc,rw,hard        0       0
clouddumps1001.wikimedia.org:   /mnt/nfs/dumps-clouddumps1001.wikimedia.org     nfs     vers=4.2,bg,intr,sec=sys,proto=tcp,noatime,lookupcache=all,nofsc,ro,soft,timeo=300,retrans=3    0       0
clouddumps1002.wikimedia.org:   /mnt/nfs/dumps-clouddumps1002.wikimedia.org     nfs     vers=4.2,bg,intr,sec=sys,proto=tcp,noatime,lookupcache=all,nofsc,ro,soft,timeo=300,retrans=3    0       0

They did get in some sort of loop retrying though, there's many messages in dmesg:

[Tue Apr  8 14:13:27 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:13:29 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:14:00 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:14:09 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:15:40 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:15:44 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:15:58 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:16:09 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:16:15 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:16:34 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:17:50 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:18:00 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:18:10 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:18:12 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out                                                                                                                                                                                                 
[Tue Apr  8 14:18:15 2025] nfs: server clouddumps1001.wikimedia.org not responding, timed out

Hmm... would have been interesting to see where the containers were getting stuck, as there's only a handful of tools that actually use the dumps, I'm suspecting that cri-o runtime does some check on the mounts before mounting the container or similar (if so, mounting the parent directory instead might avoid the issue).

My favorite fix for this would be to mount the dumps as a 'multi-attach' RO cinder volume rather than NFS. I wasn't able to get multi-attach to work properly last time I tried but Cinder is quite a bit more mature now so it might be worth another go.

Hmm... would have been interesting to see where the containers were getting stuck, as there's only a handful of tools that actually use the dumps, I'm suspecting that cri-o runtime does some check on the mounts before mounting the container or similar (if so, mounting the parent directory instead might avoid the issue).

We're back in the situation as clouddumps1002 is currently unreachable after some network maintenance (moving to a new switch in T411025: eqiad row C/D cloud hosts pending migration).

At least I have now the answer to the question above from @dcaro, pods are failing to start with FailedMount:

Warning  FailedMount  2s (x4 over 6m8s)  kubelet            MountVolume.SetUp failed for volume "dumpsrc-clouddumps1002" : hostPath type check failed: /mnt/nfs/dumps-clouddumps1002.wikimedia.org is not a directory

clouddumps1002 is back and pending pods recovered automatically.

Unassigning myself, this remains important but I am doing too many other things and I'd rather have somebody else look at it.

I briefly looked the clouddumps1002 downtime from yesterday, and of course there was ~30m downtime for dumps.w.o since 1002 serves those:

2025-11-28-122447_3690x746_scrot.png (746×3 px, 70 KB)

toolforge also experienced pending pods as mentioned by @fnegri due to the double-NFS-mount.

2025-11-28-122641_2769x1731_scrot.png (1×2 px, 193 KB)

though I can't find traces of toolforge unavailability during that period (i.e. currently running workload kept running and serve requests) in terms of 5xx served:

2025-11-28-122903_3766x1063_scrot.png (1×3 px, 139 KB)

There might be other signals of toolforge unavailability I wasn't able to find though!

I wrote down a possible plan on how to make clouddumps better in terms of host down time in T411248: Plan to make clouddumps more resilient and easier to operate. To be clear: not suggesting we take any action, though I wanted to put on record what's possible

Adding a note that PAWS can also be affected by the unavailability of one of the two clouddumps: T413428: PAWS failing to mount volumes