Dcaro got paged at ~12:40UTC about NFS home not being writtable
A quick look at the tools-nfs-2 logs shows one error:
[Sat Mar 15 12:34:50 2025] rpc-srv/tcp: nfsd: got error -32 when sending 160 bytes - shutting down socket
Looking
| dcaro | |
| Mar 15 2025, 12:43 PM |
| F58834978: image.png | |
| Mar 15 2025, 12:50 PM |
| F58834956: image.png | |
| Mar 15 2025, 12:45 PM |
Dcaro got paged at ~12:40UTC about NFS home not being writtable
A quick look at the tools-nfs-2 logs shows one error:
[Sat Mar 15 12:34:50 2025] rpc-srv/tcp: nfsd: got error -32 when sending 160 bytes - shutting down socket
Looking
Logging in to the bastion as my user and wm-lol tool, looks ok, so not general breakdown:
tools.wm-lol@tools-bastion-13:~$ toolforge jobs run --continuous --command 'while date; do sleep 1; done' --image python3.11 test tools.wm-lol@tools-bastion-13:~$ tail -f test.out Sat Mar 15 12:48:58 PM UTC 2025 Sat Mar 15 12:48:59 PM UTC 2025 Sat Mar 15 12:49:00 PM UTC 2025 Sat Mar 15 12:49:01 PM UTC 2025 Sat Mar 15 12:49:02 PM UTC 2025 Sat Mar 15 12:49:03 PM UTC 2025 Sat Mar 15 12:49:04 PM UTC 2025 Sat Mar 15 12:49:05 PM UTC 2025 Sat Mar 15 12:49:06 PM UTC 2025 Sat Mar 15 12:49:07 PM UTC 2025 Sat Mar 15 12:49:08 PM UTC 2025 Sat Mar 15 12:49:09 PM UTC 2025 Sat Mar 15 12:49:10 PM UTC 2025 ^C
tools-static-15 got stuck also, probably due to the nfs hiccup:
root@tools-static-15:~# ps aux | grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND www-data 391144 0.3 0.2 65128 9196 ? D Jan30 202:16 nginx: worker process www-data 391145 0.9 0.2 66096 11028 ? D Jan30 586:17 nginx: worker process root 1464327 0.0 0.1 15552 7332 ? Ss Feb19 2:42 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups Debian-+ 2729295 0.0 0.3 38208 14692 ? Ss Mar14 0:08 /usr/sbin/exim4 -bd -q1m root 2771699 0.0 0.0 3876 1668 pts/0 S+ 12:52 0:00 grep --color=auto D root@tools-static-15:~# dmesg -T | grep nfs [Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr' [Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr' [Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr' [Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr' [Thu Jan 23 14:11:35 2025] nfs: Deprecated parameter 'intr' [Thu Jan 30 18:17:11 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying [Thu Jan 30 18:17:13 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying [Thu Feb 20 12:18:56 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying [Sat Mar 15 12:37:49 2025] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
It got unstuck by itself:
root@tools-static-15:~# ps aux | grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1464327 0.0 0.1 15552 7332 ? Ss Feb19 2:42 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups Debian-+ 2729295 0.0 0.3 38208 14692 ? Ss Mar14 0:08 /usr/sbin/exim4 -bd -q1m root 2771749 0.0 0.0 3876 1812 pts/0 S+ 12:53 0:00 grep --color=auto D
And the alert went away :), so I'll keep this task open until monday, but the incident seems resolved (by the most part), will check again in a bit for the NFS workers to see if there's any stuck and reboot the ones that did not heal.
Mentioned in SAL (#wikimedia-cloud) [2025-03-15T12:55:54Z] <dcaro> there was an NFS hiccup that made the NFS checks fail for a second and some workers get stuck for a bit T388965
Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:14:10Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16,tools-k8s-worker-nfs-34,tools-k8s-worker-nfs-77 (T388965)
Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:14:23Z] <wmbot~dcaro@urcuchillay> END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-16,tools-k8s-worker-nfs-34,tools-k8s-worker-nfs-77 (T388965)
Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:14:31Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-77 (T388965)
Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-15T15:31:55Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-16, tools-k8s-worker-nfs-34, tools-k8s-worker-nfs-77 (T388965)
Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T14:51:01Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-75 (T388965)
Mentioned in SAL (#wikimedia-cloud-feed) [2025-03-17T14:52:39Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-75 (T388965)