Page MenuHomePhabricator

ganeti1025 VMs unresponsive Nov 1 2024
Closed, ResolvedPublic

Description

We became aware of an issue that the VMs on ganeti1025 were unresponsive this morning. Two VMs were running on it:

moscovium.eqiad.wmnet
logstash1024.eqiad.wmnet

As reported by @tappof on irc

It seems that on Ganeti1025 there are some checking operations on /dev/md2 (stuck at 39%). Also, the load average increased around the same time that the Logstash1024/Moscovium issue was reported.

The load average spiked at about 10:45 according to grafana and just flatlined at about 15. Console to the VMs appeared to open from the master node, but not output was shown at all. I think the issue may be similar to T348730: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye, we see these logs starting at the same time:

Nov  1 10:45:01 ganeti1025 CRON[1478566]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Nov  1 10:45:01 ganeti1025 CRON[1478567]: (root) CMD ([ -x /usr/sbin/ganeti-watcher ] && /usr/sbin/ganeti-watcher)
Nov  1 10:45:05 ganeti1025 kernel: [12360956.583238] block drbd4: We did not send a P_BARRIER for 42072ms > ko-count (7) * timeout (60 * 0.1s); drbd kernel thread blocked?
Nov  1 10:45:06 ganeti1025 kernel: [12360957.212642] drbd resource6: meta connection shut down by peer.
Nov  1 10:45:06 ganeti1025 kernel: [12360957.218792] drbd resource6: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
Nov  1 10:45:06 ganeti1025 kernel: [12360957.421439] drbd resource2: meta connection shut down by peer.

After which the "kernel thread blocked?" messages repeat regularly, and others saying task md2_raid5:566 blocked for more than 120 seconds. (see full logs).

I didn't immediately try to drain the node as I feared the issues would prevent the snapshot/move process from working, so instead tried to shut down the instances. This did not work, the jobs were added but they went into "error" state:

Result: 
  - OpExecError
  - - Could not shutdown instance 'logstash1024.eqiad.wmnet': Error 28: Operation timed out after 900568 milliseconds with 0 bytes received

Eventually I rebooted the node (using cookbook). That didn't seem to cleanly shut down, after a few mins a console on the host just showed a flashing cursor, so I issued a power cycle via the idrac. After that the node seemed to boot ok, and the VMs showed as being on it but in 'admin down' state (presumably due to previous shutdown command). That allowed me to issue gnt-instance failover for the two VMs, after which I started them and they seem to be ok.

Currently the high-level health stats for ganeti1025 seem ok, but best we keep it drained for now and try to find exactly what happened here.

Event Timeline

cmooney triaged this task as Medium priority.

I'm pretty confident this is the same as T348730, and I think it would be okay to return ganeti1025 to service and close this task as a dup

cmooney claimed this task.

I'm pretty confident this is the same as T348730, and I think it would be okay to return ganeti1025 to service and close this task as a dup

Ok yes from our discussion on irc that seems ok. In terms of service the node is part of the cluster, just the primary instances that were on it are moved. So I'm not sure we need to do anything in particular to bring it back in to service. I'll mention to Moritz in case he wants to do a manual rebalance.

I'm pretty confident this is the same as T348730, and I think it would be okay to return ganeti1025 to service and close this task as a dup

Ok yes from our discussion on irc that seems ok. In terms of service the node is part of the cluster, just the primary instances that were on it are moved. So I'm not sure we need to do anything in particular to bring it back in to service. I'll mention to Moritz in case he wants to do a manual rebalance.

Thanks for handling these. It's in fact the DRBD freezes we've seen before. When the current Ganeti refreshes are completed (codfw is mostly done, eqiad jus started) we'll have roughly half of Ganeti eqiad/codfw on Bookworm. And following that I'll start migrating the remaining Bullseye nodes to Bookworm, so hopefully we won't see these too often anymore.

A rebalance is currently not needed, with the ongoing refreshes, they are implicitly rebalanced anyway, as I move VMs of old nodes for decom.