We became aware of an issue that the VMs on ganeti1025 were unresponsive this morning. Two VMs were running on it:
moscovium.eqiad.wmnet logstash1024.eqiad.wmnet
As reported by @tappof on irc
It seems that on Ganeti1025 there are some checking operations on /dev/md2 (stuck at 39%). Also, the load average increased around the same time that the Logstash1024/Moscovium issue was reported.
The load average spiked at about 10:45 according to grafana and just flatlined at about 15. Console to the VMs appeared to open from the master node, but not output was shown at all. I think the issue may be similar to T348730: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye, we see these logs starting at the same time:
Nov 1 10:45:01 ganeti1025 CRON[1478566]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Nov 1 10:45:01 ganeti1025 CRON[1478567]: (root) CMD ([ -x /usr/sbin/ganeti-watcher ] && /usr/sbin/ganeti-watcher) Nov 1 10:45:05 ganeti1025 kernel: [12360956.583238] block drbd4: We did not send a P_BARRIER for 42072ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked? Nov 1 10:45:06 ganeti1025 kernel: [12360957.212642] drbd resource6: meta connection shut down by peer. Nov 1 10:45:06 ganeti1025 kernel: [12360957.218792] drbd resource6: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Nov 1 10:45:06 ganeti1025 kernel: [12360957.421439] drbd resource2: meta connection shut down by peer.
After which the "kernel thread blocked?" messages repeat regularly, and others saying task md2_raid5:566 blocked for more than 120 seconds. (see full logs).
I didn't immediately try to drain the node as I feared the issues would prevent the snapshot/move process from working, so instead tried to shut down the instances. This did not work, the jobs were added but they went into "error" state:
Result: - OpExecError - - Could not shutdown instance 'logstash1024.eqiad.wmnet': Error 28: Operation timed out after 900568 milliseconds with 0 bytes received
Eventually I rebooted the node (using cookbook). That didn't seem to cleanly shut down, after a few mins a console on the host just showed a flashing cursor, so I issued a power cycle via the idrac. After that the node seemed to boot ok, and the VMs showed as being on it but in 'admin down' state (presumably due to previous shutdown command). That allowed me to issue gnt-instance failover for the two VMs, after which I started them and they seem to be ok.
Currently the high-level health stats for ganeti1025 seem ok, but best we keep it drained for now and try to find exactly what happened here.