Page MenuHomePhabricator

Create a spicerack cookbook to empty a ganeti node from VMs
Closed, ResolvedPublic

Description

After having successfully completed T203963, we could build on the stuff created for that task to create a cookbook that empties a node for maintenance

  • Connects to the ganeti cluster master (figures it out via gnt-cluster getmaster if necessary)
  • Live migrates all running VMs on said node (gnt-node evacuate -p $node should be good)
  • Verifies the above has completed successfully
  • Fails over VMs for which the host is primary but are not running (gnt-node failover)
  • Moves secondary VMs (if requested) from the host (gnt-node evacuate -s)

A variation of the above that would also be helpful (probably as it's own cookbook) would be:

  • Do the above to a machine
  • Reboot it
  • wait for it to come back online
  • run gnt-cluster verify-disks to force DBRD pair syncing with the rest of the cluster
  • Proceed to the next node

Aka rolling reboot

Event Timeline

Dzahn triaged this task as Medium priority.Oct 12 2018, 5:46 PM
Dzahn subscribed.
crusnov moved this task from Up next to Backlog on the SRE-tools board.

Change 924498 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] Add a cookbook to drain a Ganeti node

https://gerrit.wikimedia.org/r/924498

Change 924498 merged by Muehlenhoff:

[operations/cookbooks@master] Add a cookbook to drain a Ganeti node

https://gerrit.wikimedia.org/r/924498

Change 932167 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] sre.ganeti.drain-node: Add the option to reboot the drained node

https://gerrit.wikimedia.org/r/932167

Change 932167 merged by Muehlenhoff:

[operations/cookbooks@master] sre.ganeti.drain-node: Add the option to reboot the drained node

https://gerrit.wikimedia.org/r/932167

Change 932237 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] Fix migration when "plain" instances are involved

https://gerrit.wikimedia.org/r/932237

Change 932237 merged by Muehlenhoff:

[operations/cookbooks@master] Fix migration when "plain" instances are involved

https://gerrit.wikimedia.org/r/932237

Change 933482 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] Don't reboot Ganeti master nodes

https://gerrit.wikimedia.org/r/933482

Change 933482 merged by Muehlenhoff:

[operations/cookbooks@master] Don't reboot Ganeti master nodes

https://gerrit.wikimedia.org/r/933482

Change 933852 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] sre.ganeti.drain-node: Pass -f to evacuate command

https://gerrit.wikimedia.org/r/933852

Change 933852 merged by Muehlenhoff:

[operations/cookbooks@master] sre.ganeti.drain-node: Pass -f to evacuate command

https://gerrit.wikimedia.org/r/933852

Change 934248 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] sre.ganeti.drain-vm: Sync DRBD after reboot

https://gerrit.wikimedia.org/r/934248

Change 934248 merged by Muehlenhoff:

[operations/cookbooks@master] sre.ganeti.drain-vm: Sync DRBD after reboot

https://gerrit.wikimedia.org/r/934248

This has been implemented with the new sre.ganeti.drain-node cookbook, which I've used for the latest round of reboots.

By default only primary instances are moved away. This can be used for reboots and similar short term maintenance. If a host is going away for a longer time (or if all data will be lost in a reimage), the --full option also moves the secondary instances to other nodes.

By default all Ganeti nodes uses replicate DRBD storage, but for latency-sensitive services (currently only needed by etcd) the overhead of DRBD may cause visible latency issues. These hosts are stored with local disk storage instead (called "plain").

If only primary instances are drained, such instances are ignored (since they are inherently non-redundant). If a node is fully drained, such instances need to be temporarily switched to DRBD using the sre.ganeti.changedisk cookbook first.

With the --reboot option the cookbook also calls the sre.hosts.reboot-single cookbook to directly initiate a reboot. I've also added a sanity check which prevents reboots of the current master node.