Page MenuHomePhabricator

Changing the IPs of cloudcephmons should not require VM reboots
Open, MediumPublic

Description

As we discovered in T383583: VM nova records attached to incorrect cloudcephmon IPs and T385264: VM live migration failing for many/most VMs, when we change the IPs of cloudcephmon hosts (for example because we replace them with new hardware) a lot of trouble ensues, that can only be fixed by mass-rebooting VMs.

We should find a way to improve this situation, so that we can be more confident when the time comes that we need to replace the current cloudcephmon hosts.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What about using both service IPs and service FQDNs? this way the clients may not notice if we assign an IP to a different host.

They may notice anyway, because how the RBD protocol works, who knows.

Yes, I think service IPs/fqdns is the solution to this. This discussion post states that a reboot is necessary for any actual change: https://www.reddit.com/r/openstack/comments/11ynmy0/best_process_for_replacing_ceph_monitors_in/

Yes, I think service IPs/fqdns is the solution to this. This discussion post states that a reboot is necessary for any actual change: https://www.reddit.com/r/openstack/comments/11ynmy0/best_process_for_replacing_ceph_monitors_in/

I would not be surprised if you can't transparently replace a mon given that they use cookies/crypto sessions, probably would require removing the mon from the cluster, and re-adding it again with the same name/ip. We can try though it's not the "designed" way of adding/removing mons so here be dragons (https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address).

It's interesting though that the VMs work well as long as they are not rebooted, because while they are connected to the cluster the mons list gets updated live, it's only when starting from the "stored state" on the hypervisor (running the old qemu command) that they fail to start.

We can for sure add an alert/warning detecting that situation though, so we don't forget it's happening.