It's not showing up in grafana or in the alerts, but that node is stuck on nfs and has many processes stuck:
root@tools-k8s-worker-nfs-24:~# curl --silent http://127.0.0.1:9100/metrics | grep node_processes_state # HELP node_processes_state Number of processes in each state. # TYPE node_processes_state gauge node_processes_state{state="D"} 31 node_processes_state{state="I"} 81 node_processes_state{state="R"} 1 node_processes_state{state="S"} 153
I suspect this has been happening for a bit, and this to be the cause of the stuck pods during the upgrade (and a few of the current stuck ones)