Page MenuHomePhabricator

[nfs] 2025-01-08 tools-nfs outage
Closed, ResolvedPublic

Description

NFS had a hiccup at ~14:00 UTC that made a bunch of the workers get some stuck
processes, notice because tools-static-15 got stuck too and showed up an alert
right away.

Will fill up details of the debugging.

It might have been triggered by the draining of cloudcephosd1012, that happened around that time:

164 2025-01-08 14:04:42,975 dcaro 2046235 [DEBUG _cookbook.py:511 in main] Executing cookbook wmcs.ceph.osd.depool_and_destroy with args: ['--osd-hostname', 'cloudcephosd1012', '--cluster-name', 'eqiad1', '--task-id', 'T309789', '--all-osds']

Some investigation, there were two rounds of draining for the host (2 osd daemons each time), the first one went ok, and there was no issues with the network or nfs, the second saturated the switches a bit more, and triggered extra drops, you can see the traffic of the switches here (and dropped packets):

image.png (1,712×641 px, 198 KB)

And lost pings between osd nodes too, were there's some on the first rebalance, but the second created a bigger cluster:

image.png (1,556×594 px, 83 KB)

Now, ceph did not report any slow operations or similar, so my hint is that the noisy traffic from the rebalance might be affecting the VMs traffic and making NFS flaky, more than ceph not replying and the NFS server having issues writting/reading to disk causing it to time out replies.

Event Timeline

dcaro triaged this task as High priority.
dcaro changed the task status from Open to In Progress.Jan 8 2025, 3:14 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 17) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T15:55:00Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-58 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:00:39Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-58 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:20:14Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-35 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:25:38Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-35 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:33:23Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-72 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:38:45Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-72 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:45:47Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-65 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:51:09Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-65 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:52:11Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-57 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T16:57:32Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-57 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:01:08Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-48 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:06:33Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-48 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:11:23Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-12 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:14:16Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-12 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:22:09Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-44 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:27:31Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-44 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:28:39Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-76 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:33:59Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-76 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:35:55Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-26 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:41:14Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-26 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:43:07Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-37 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:48:29Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-37 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:48:33Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-67 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:53:50Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-67 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:53:55Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-27 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:59:11Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-27 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T17:59:15Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-8 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:04:31Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-8 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:04:35Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-41 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:06:09Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-41 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:06:12Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-47 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:12:22Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-47 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:12:26Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-1 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:14:10Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-1 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:14:12Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:19:44Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:26:13Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-32 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:34:18Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-32 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:34:22Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-43 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-08T18:39:32Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-43 (T383238)

I finished up rebalancing the ceph node the morning after, while doing snapshots of the processes both in the tools-nfs server, and one of the workers, and I was unable to reproduce the issue. We are moving to QoS (done on Thursday after the rebalance), so this might not be an issue in the future (crossed fingers). I'm closing this one, but will create a new one if it happens again and try to keep digging.

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 17) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:38:28Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-1 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:42:02Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-1 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:42:07Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-20 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:42:11Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-20 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:47:51Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-58 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:53:13Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-58 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:53:16Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:58:37Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T20:58:40Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-13 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:03:31Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-13 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:03:35Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-35 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:08:53Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-35 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:08:57Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-2 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:13:02Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:14:04Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-75 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:14:14Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-2 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:14:17Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-21 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:18:36Z] <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-21 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:24:36Z] <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-19 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-13T21:29:56Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-19 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T18:32:33Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-10 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T18:32:42Z] <wmbot~dcaro@acme> END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-10 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T18:36:06Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-10 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T18:37:51Z] <wmbot~dcaro@acme> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-10 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T18:41:00Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-10 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T18:42:07Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-10 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T19:00:02Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-57 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T19:01:36Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-57 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-18T09:57:36Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-9 (T383238)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-18T10:03:18Z] <wmbot~dcaro@acme> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-9 (T383238)