Page MenuHomePhabricator

[infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state
Closed, ResolvedPublic

Assigned To
Authored By
dcaro
Nov 6 2024, 9:38 AM
Referenced Files
F57685362: image.png
Nov 6 2024, 9:59 AM
F57685357: image.png
Nov 6 2024, 9:55 AM
F57685352: image.png
Nov 6 2024, 9:53 AM
F57685340: image.png
Nov 6 2024, 9:50 AM
F57685338: image.png
Nov 6 2024, 9:50 AM
Subscribers

Description

It's not showing up in grafana or in the alerts, but that node is stuck on nfs and has many processes stuck:

root@tools-k8s-worker-nfs-24:~# curl --silent http://127.0.0.1:9100/metrics | grep node_processes_state
# HELP node_processes_state Number of processes in each state.
# TYPE node_processes_state gauge
node_processes_state{state="D"} 31
node_processes_state{state="I"} 81
node_processes_state{state="R"} 1
node_processes_state{state="S"} 153

I suspect this has been happening for a bit, and this to be the cause of the stuck pods during the upgrade (and a few of the current stuck ones)

Event Timeline

Oh, I think it might be because the status of the VM is pending confirming a resize/migrate operation:

image.png (471×455 px, 57 KB)

That started on the 20th of Sept:

image.png (246×1 px, 29 KB)

Manually confirmed the migration, and the server is back in 'active' state, let's see if now it shows up in prometheus.

There you go, waiting for the first scrape:

image.png (50×2 px, 12 KB)

The alert should trigger soon:

image.png (391×3 px, 94 KB)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-06T10:13:24Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-24 (T379139)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-06T10:14:08Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-24 (T379139)

dcaro claimed this task.
dcaro triaged this task as High priority.
dcaro moved this task from Next Up to Done on the Toolforge (Toolforge iteration 16) board.