NFS had a hiccup at ~14:00 UTC that made a bunch of the workers get some stuck
processes, notice because tools-static-15 got stuck too and showed up an alert
right away.
Will fill up details of the debugging.
It might have been triggered by the draining of cloudcephosd1012, that happened around that time:
164 2025-01-08 14:04:42,975 dcaro 2046235 [DEBUG _cookbook.py:511 in main] Executing cookbook wmcs.ceph.osd.depool_and_destroy with args: ['--osd-hostname', 'cloudcephosd1012', '--cluster-name', 'eqiad1', '--task-id', 'T309789', '--all-osds']
Some investigation, there were two rounds of draining for the host (2 osd daemons each time), the first one went ok, and there was no issues with the network or nfs, the second saturated the switches a bit more, and triggered extra drops, you can see the traffic of the switches here (and dropped packets):
And lost pings between osd nodes too, were there's some on the first rebalance, but the second created a bigger cluster:
Now, ceph did not report any slow operations or similar, so my hint is that the noisy traffic from the rebalance might be affecting the VMs traffic and making NFS flaky, more than ceph not replying and the NFS server having issues writting/reading to disk causing it to time out replies.

