Page MenuHomePhabricator

[infra] NFS hangs in some workers until the worker is rebooted (2024-05-14)
Closed, ResolvedPublic

Description

This time is tools-k8s-worker-nfs-9 the one having issues.
From alert:

image.png (352×446 px, 42 KB)

Previous task: https://phabricator.wikimedia.org/T362690

D process numbers per-worker:

image.png (345×878 px, 59 KB)

There's two workers with high numbers but very different patterns, one is spiky (nfs-52) and the other is continuous (nfs-9, the one that triggered the alert).

Event Timeline

dcaro triaged this task as High priority.May 14 2024, 7:54 AM
dcaro raised the priority of this task from High to Needs Triage.
dcaro triaged this task as High priority.
dcaro moved this task from Ready to be worked on to Toolforge iteration 09 on the Toolforge board.
dcaro edited projects, added Toolforge (Toolforge iteration 09); removed Toolforge.

nfs-9 has been drained without issues, so I can start debugging on it.

Looking into nfs-52...

The load on nfs-52 seems to come from osm4wiki processes, that has been killed several times for going over the memory limit. NFS is responsive, just slow when the load becomes high, so probably unrelated.

on nfs-9, there's 3 errors that repeat over and over in the logs, until right the moment when the number of D processes start to raise:

root@tools-k8s-worker-nfs-9:~# journalctl --since "2024-05-14 04:00:00" | grep kubelet | grep 'Error syncing pod'
...
May 14 05:13:37 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:37.341891  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:13:47 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:47.340454  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:13:47 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:47.340979  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"job\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=job pod=k8s-20170515.signature-check.wiktionary-5669b58c56-hbt5s_tool-signature-checker(fa1e050d-3002-40a1-9672-0eed6956e925)\"" pod="tool-signature-checker/k8s-20170515.signature-check.wiktionary-5669b58c56-hbt5s" podUID=fa1e050d-3002-40a1-9672-0eed6956e925
May 14 05:13:49 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:49.339518  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:13:59 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:59.340072  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:00 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:00.339848  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:14:13 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:13.340330  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:14:14 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:14.339403  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:25 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:25.340147  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:27 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:27.342502  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:14:36 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:36.340171  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:51 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:51.340674  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:15:04 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:15:04.341238  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
##### stops here, no more errors after this

Then there's a big gap on any activity until I purged the node:

## some minor activity between 05:15:00 and 05:40:00
May 14 05:40:16 tools-k8s-worker-nfs-9 containerd[601]: time="2024-05-14T05:40:16.534134054Z" level=info msg="StartContainer for \"dd65176e6230044ab19f6bca499bd2e304016b5692a61e0acf727a64207a70f1\" returns successfully"                                                                                                                                                                   
May 14 05:40:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.yndrid.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:41:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.z3H1eM.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:46:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.AIixdZ.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:49:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.RDrLXL.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:49:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.WeyeaP.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:51:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.BnRiyl.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:51:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.3xalYO.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:51:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.rAFWuJ.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:52:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.upJ7K5.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:54:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.38SPCl.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 05:59:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.92zrfe.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:02:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.zS9hbp.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:02:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.ee5qnH.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:02:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.c9SUGQ.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:03:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.AK1Tl9.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:05:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.lnu7Fy.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:05:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.dJt9i8.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:09:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.1ASzir.mount: Deactivated successfully.                                                                                                                                                                                                   
May 14 06:11:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.kZJNrZ.mount: Deactivated successfully.
May 14 06:11:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.g7Xvaa.mount: Deactivated successfully.
May 14 06:12:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.s0gSpI.mount: Deactivated successfully.
May 14 06:20:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.zyZYZN.mount: Deactivated successfully.
May 14 06:23:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.GFGN0k.mount: Deactivated successfully.
May 14 06:26:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.BYMpHG.mount: Deactivated successfully.
May 14 06:28:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.ZoXeCU.mount: Deactivated successfully.
May 14 06:30:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.XIkKG2.mount: Deactivated successfully.
May 14 06:35:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.U5tDXd.mount: Deactivated successfully.
May 14 06:36:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.dI9qdz.mount: Deactivated successfully.
May 14 06:37:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.Pc9TCM.mount: Deactivated successfully.
May 14 06:38:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.FlKvtI.mount: Deactivated successfully.
May 14 06:39:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.ggdZeD.mount: Deactivated successfully.
May 14 06:45:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.SMoCrc.mount: Deactivated successfully.
May 14 06:47:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.zK528H.mount: Deactivated successfully.
May 14 06:48:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.4bkJLa.mount: Deactivated successfully.
May 14 06:56:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.tbUxBD.mount: Deactivated successfully.
May 14 06:57:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.SKB7US.mount: Deactivated successfully.
May 14 06:59:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.gFdoED.mount: Deactivated successfully.
May 14 07:00:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.SGI9zs.mount: Deactivated successfully.
May 14 07:00:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.dIFn9T.mount: Deactivated successfully.
May 14 07:05:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.8yGNCt.mount: Deactivated successfully.
May 14 07:06:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.o8eVkM.mount: Deactivated successfully.
May 14 07:12:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.LA0iRM.mount: Deactivated successfully.
May 14 07:12:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.TIWp7Y.mount: Deactivated successfully.
May 14 07:13:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.gfYFdc.mount: Deactivated successfully.
May 14 07:15:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.o7v0Dc.mount: Deactivated successfully.
May 14 07:19:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.3XsY0E.mount: Deactivated successfully.
May 14 07:20:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.ndOh6B.mount: Deactivated successfully.
May 14 07:22:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.1tKLLa.mount: Deactivated successfully.
May 14 07:27:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.BQ6LyA.mount: Deactivated successfully.
May 14 07:28:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.yAZnDj.mount: Deactivated successfully.
May 14 07:28:21 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.7k3nIL.mount: Deactivated successfully.
May 14 07:29:51 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.mFIWs8.mount: Deactivated successfully.
May 14 07:32:11 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.sz61hE.mount: Deactivated successfully.
May 14 07:33:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.q41u5r.mount: Deactivated successfully.
May 14 07:38:01 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.QFHl59.mount: Deactivated successfully.
May 14 07:38:31 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.n6UqBU.mount: Deactivated successfully.
May 14 07:47:41 tools-k8s-worker-nfs-9 systemd[1]: run-containerd-runc-k8s.io-7c084195c72e6356f812b21c7e914b6430e6cb152f4a9493066e9ec8eb4caa84-runc.LqajjL.mount: Deactivated successfully.
May 14 07:48:14 tools-k8s-worker-nfs-9 containerd[601]: time="2024-05-14T07:48:14.686321859Z" level=info msg="StopContainer for \"2df36be67beb0b2c641c4336ea25064897814acf0d821fc9e0196485a69feb19\" with timeout 1 (s)"   <- purge started here

dmesg also shows frequent activity every few minutes, and then at 05:07 there's a 30min gap until 05:37 where there's the last log:

[Tue May 14 04:57:37 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue May 14 04:57:37 2024] IPv6: ADDRCONF(NETDEV_CHANGE): cali4e0ab78cff4: link becomes ready
[Tue May 14 04:57:37 2024] audit: type=1400 audit(1715662808.050:106479): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali4e0ab78cff4/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 04:57:37 2024] audit: type=1400 audit(1715662808.054:106480): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali4e0ab78cff4/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 04:57:45 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue May 14 04:57:45 2024] IPv6: ADDRCONF(NETDEV_CHANGE): cali721102f8203: link becomes ready
[Tue May 14 04:57:45 2024] audit: type=1400 audit(1715662815.546:106481): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali721102f8203/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 04:57:45 2024] audit: type=1400 audit(1715662815.546:106482): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali721102f8203/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:02:31 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue May 14 05:02:31 2024] IPv6: ADDRCONF(NETDEV_CHANGE): cali6b0fedab198: link becomes ready
[Tue May 14 05:02:31 2024] audit: type=1400 audit(1715663102.464:106483): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali6b0fedab198/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:02:31 2024] audit: type=1400 audit(1715663102.464:106484): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali6b0fedab198/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:02:34 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue May 14 05:02:34 2024] IPv6: ADDRCONF(NETDEV_CHANGE): cali84126633007: link becomes ready
[Tue May 14 05:02:34 2024] audit: type=1400 audit(1715663104.748:106485): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali84126633007/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:02:34 2024] audit: type=1400 audit(1715663104.748:106486): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali84126633007/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:03:34 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue May 14 05:03:34 2024] IPv6: ADDRCONF(NETDEV_CHANGE): cali940e6d5fee7: link becomes ready
[Tue May 14 05:03:34 2024] audit: type=1400 audit(1715663164.664:106487): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali940e6d5fee7/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:03:34 2024] audit: type=1400 audit(1715663164.664:106488): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali940e6d5fee7/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:07:35 2024] IPv6: ADDRCONF(NETDEV_CHANGE): cali7e588644315: link becomes ready
[Tue May 14 05:07:35 2024] audit: type=1400 audit(1715663405.502:106489): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/cali7e588644315/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:37:45 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Tue May 14 05:37:45 2024] IPv6: ADDRCONF(NETDEV_CHANGE): calif449e73454d: link becomes ready
[Tue May 14 05:37:45 2024] audit: type=1400 audit(1715665215.883:106490): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/calif449e73454d/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[Tue May 14 05:37:45 2024] audit: type=1400 audit(1715665215.883:106491): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/sys/devices/virtual/net/calif449e73454d/type" pid=1446675 comm="sssd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

That seems to point to k8s stopping to allocate pods in the node, so no more container activity happened after the NFS got stuck, looking

dcaro changed the task status from Open to In Progress.May 14 2024, 1:53 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 09) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-15T14:10:56Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-9 (T364822)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-15T14:11:50Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-9 (T364822)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-15T14:16:51Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-9 (T364822)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-15T14:17:41Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-9 (T364822)

Just restarted the node and took it back into the pool, will try to debug more the next time it happens.

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 10) board.