There is a lot of concern and conjecture here. Things we know:
The cloud-hosts1-eqiad is routed through row B for east-west traffic, and that will include the entire cloud ceph cluster and the entire set of hypervisors.
A possibly 10 second network issue for the hypervisors would not have been terribly concerning except for some very sensitive services on VMs, but since we now have remote ceph storage, there is a possibility of corruption of disks and even ceph cluster collapse.
Things to do:
- Test if we can put together a good "freeze" procedure for ceph and cloud in codfw (@dcaro)
- Get a freeze script ready
- Send communication to our user community once we have a better idea of what is going to happen
- Make sure the cloud doesn't go pear-shaped 🍐
- profit!!!