Page MenuHomePhabricator

2025-03-15 Tools NFS hiccup
Closed, ResolvedPublic

Description

Dcaro got paged at ~12:40UTC about NFS home not being writtable

A quick look at the tools-nfs-2 logs shows one error:

[Sat Mar 15 12:34:50 2025] rpc-srv/tcp: nfsd: got error -32 when sending 160 bytes - shutting down socket

Looking

Event Timeline

dcaro triaged this task as High priority.

Logging in to the bastion as my user and wm-lol tool, looks ok, so not general breakdown:

tools.wm-lol@tools-bastion-13:~$ toolforge jobs run --continuous --command 'while date; do sleep 1; done' --image python3.11 test

tools.wm-lol@tools-bastion-13:~$ tail -f test.out
Sat Mar 15 12:48:58 PM UTC 2025
Sat Mar 15 12:48:59 PM UTC 2025
Sat Mar 15 12:49:00 PM UTC 2025
Sat Mar 15 12:49:01 PM UTC 2025
Sat Mar 15 12:49:02 PM UTC 2025
Sat Mar 15 12:49:03 PM UTC 2025
Sat Mar 15 12:49:04 PM UTC 2025
Sat Mar 15 12:49:05 PM UTC 2025
Sat Mar 15 12:49:06 PM UTC 2025
Sat Mar 15 12:49:07 PM UTC 2025
Sat Mar 15 12:49:08 PM UTC 2025
Sat Mar 15 12:49:09 PM UTC 2025
Sat Mar 15 12:49:10 PM UTC 2025
^C

Some of the workers recovered:

image.png (1×1 px, 264 KB)

And the page got resolved by itself

tools-static-15 got stuck also, probably due to the nfs hiccup:

root@tools-static-15:~# ps aux | grep D
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
www-data  391144  0.3  0.2  65128  9196 ?        D    Jan30 202:16 nginx: worker process
www-data  391145  0.9  0.2  66096 11028 ?        D    Jan30 586:17 nginx: worker process
root     1464327  0.0  0.1  15552  7332 ?        Ss   Feb19   2:42 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
Debian-+ 2729295  0.0  0.3  38208 14692 ?        Ss   Mar14   0:08 /usr/sbin/exim4 -bd -q1m
root     2771699  0.0  0.0   3876  1668 pts/0    S+   12:52   0:00 grep --color=auto D

root@tools-static-15:~# dmesg -T | grep nfs
[Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr'
[Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr'
[Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr'
[Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr'
[Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr'
[Thu Jan 30 18:17:11 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
[Thu Jan 30 18:17:13 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
[Thu Feb 20 12:18:56 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
[Sat Mar 15 12:37:49 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying

It got unstuck by itself:

root@tools-static-15:~# ps aux | grep D
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     1464327  0.0  0.1  15552  7332 ?        Ss   Feb19   2:42 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
Debian-+ 2729295  0.0  0.3  38208 14692 ?        Ss   Mar14   0:08 /usr/sbin/exim4 -bd -q1m
root     2771749  0.0  0.0   3876  1812 pts/0    S+   12:53   0:00 grep --color=auto D

And the alert went away :), so I'll keep this task open until monday, but the incident seems resolved (by the most part), will check again in a bit for the NFS workers to see if there's any stuck and reboot the ones that did not heal.

Mentioned in SAL (#wikimedia-cloud) [2025-03-15T12:55:54Z] <dcaro> there was an NFS hiccup that made the NFS checks fail for a second and some workers get stuck for a bit T388965

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:14:10Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16,tools-k8s-worker-nfs-34,tools-k8s-worker-nfs-77 (T388965)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:14:23Z] <wmbot~dcaro@urcuchillay> END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-16,tools-k8s-worker-nfs-34,tools-k8s-worker-nfs-77 (T388965)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:14:31Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-77 (T388965)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:31:55Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-77 (T388965)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T14:51:01Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T388965)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T14:52:39Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-75 (T388965)

taavi subscribed.

Anything left to do here?

Anything left to do here?

Nope, closing