We have several clusters of Cassandra in production, and once in a while we need to roll restart all their jvms for security upgrades. Ideally this could be done by a cookbook rather than manually.
What I usually do for the AQS cluster is (two Cassandra instances for each of the 6 nodes):
- select one host
- check nodetool-a and nodetool-b, they should return a list of 12 IPs with UN state each (without any errors for say instance bootstrapping or down)
- nodetool-a drain + systemctl restart cassandra-a and nodetool-b drain` + systemctl restart cassandra-b
- wait until nodetool-a and nodetool-b return 12 IPs with UN state
- proceed with the next host
A couple of notes:
- nodetool drain is probably not needed, but it seems a good step to add anyway.
- 4) in theory could be simplified in something like "wait 5 minutes, run nodetool-a status | egrep '^DN' | wc -l and check that it is 12, fail otherwise". But the sleep time depends of course from the cluster's data and should be configurable (with a sane default).
Suggestions are welcome!