Page MenuHomePhabricator

Widespread cloud ceph and hypervisor issues possible with reconfiguration of Eqiad Row B
Closed, DeclinedPublic

Description

There is a lot of concern and conjecture here. Things we know:
The cloud-hosts1-eqiad is routed through row B for east-west traffic, and that will include the entire cloud ceph cluster and the entire set of hypervisors.

A possibly 10 second network issue for the hypervisors would not have been terribly concerning except for some very sensitive services on VMs, but since we now have remote ceph storage, there is a possibility of corruption of disks and even ceph cluster collapse.

Things to do:

  • Test if we can put together a good "freeze" procedure for ceph and cloud in codfw (@dcaro)
  • Get a freeze script ready
  • Send communication to our user community once we have a better idea of what is going to happen
  • Make sure the cloud doesn't go pear-shaped 🍐
  • profit!!!

Event Timeline

Did a quick test on codfw, for different scenarios:

  • When pausing the cluster (no reads, no writes) -> VMs hang on syncing disk (cached writes work), but no major issues happen at libvirt level, at least for 30s
    • When unpausing everything goes back to normal (syncs go through)
  • When pausing the cluster, and taking down the monitors -> same behavior as before
    • When bringing up the monitors, waiting for quorum (<1s), and unpausing, everything goes back to normal (same as before)
  • When pausing the VMs, then the cluster, and taking down the monitors -> No issues on libvirt, at least for 1min
    • When bringing up the monitors, waiting for quorum (<1s), and unpausing the cluster and VMs, everything goes back to normal (though external connections to the VM time out, ex. ssh)

I have not tried to leave it not running for longer, but will try and report back (enough for timeouts to kick in)

I'll prepare a script to pause/unpause all VMs

We decided to ride it out instead and won.