In the current configuration, restbase nodes have a separate /var partition. This presents a risk for the functioning of Cassandra since the same partition is used for both Cassandra's data and all of the logs collected locally. We need to separate them off so that possible partition fills do not push Cassandra to malfunction.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
install_server: cassandra to /srv for 2 ssd hosts | operations/puppet | production | +70 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • mobrovac | T112648 enable restbase syslog/file logging | |||
Resolved | fgiunchedi | T113714 Separate /var on restbase | |||
Resolved | fgiunchedi | T121575 Expand SSD space in Cassandra cluster | |||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Resolved | fgiunchedi | T127333 install SSDs in restbase2001-restbase2006 | |||
Unknown Object (Task) | |||||
Resolved | RobH | T125842 normalize eqiad restbase cluster - replace restbase1001-1006 | |||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Resolved | RobH | T126626 3x additional SSD for restbase hp hardware | |||
Unknown Object (Task) | |||||
Resolved | Cmjohnson | T128107 install restbase1010-restbase1015 | |||
Resolved | fgiunchedi | T127951 expand raid0 in restbase200[1-6] |
Event Timeline
Change 242098 had a related patch set uploaded (by Filippo Giunchedi):
install_server: cassandra to /srv for 2 ssd hosts
we're going to piggyback on multi-instance work for this too, plan is to start with restbase-test2* machines and start converting to multi instance (2x machine since they have 32gb of ram only)
Change 242098 merged by Filippo Giunchedi:
install_server: cassandra to /srv for 2 ssd hosts
eqiad is done, codfw has restbase200[356] to be converted to multi-instance, which will resolve this too
supposedly just moving cassandra's data directory to a different path and use cassandra.replace_address option should just work to effectively move from /var/lib/cassandra to /srv/cassandra-a.
Proposed steps:
- systemctl mask cassandra
- puppet agent --disable
- nodetool drain
- reboot in single user mode (root password required)
- systemctl stop nfs-common
- umount /var
- mount /dev/mapper/<lv> /mnt
- rsync -vaz /mnt/ --exclude lib/cassandra /var
- mv /mnt/lib/cassandra /mnt/cassandra-a
- rm -r /mnt/{backups,cache,lib,local,lock,log,lost+found,mail,opt,run,spool,tmp,userarchive}
- rsync -vaz /srv/ /mnt/
- rm -rf /srv
- install -d -o root -g root /srv
- lvrename <HOSTNAME>-var <HOSTNAME>-srv
- change fstab to reflect /var vs /srv change
- reboot
- ls -d /srv/deployment/ /srv/cassandra-a
- add instance -a to puppet
- puppet agent --enable
- puppet agent --test
I'm not sure about this part; Did you read something that suggests using cassandra.replace_address is necessary?
Proposed steps:
- systemctl mask cassandra
- puppet agent --disable
- nodetool drain
- reboot in single user mode
- mount /dev/mapper/<lv> /mnt
I guess /dev/mapper/<lv> isn't mounted at /var in single-user mode (i.e. no unmount is required)?
- rsync -vaz /mnt/ --exclude /mnt/lib/cassandra /var
- mv /mnt/lib/cassandra /mnt/cassandra-a
- rm -r /mnt/{backups,cache,lib,local,lock,log,lost+found,mail,opt,run,spool,tmp,userarchive}
- rsync -vaz /srv/ /mnt/
- rm -rf /srv
- install -d -o root -g root /srv
- lvrename <HOSTNAME>-var <HOSTNAME>-srv
- change fstab to reflect /var vs /srv change
- reboot
- ls -d /srv/deployment/ /srv/cassandra-a
- add instance -a to puppet
- puppet agent --enable
- systemctl mask cassandra-a
- launch cassandra with replace-address
- puppet agent --test
I've assumed it'd be necessary since we're moving from the machine main ip address to its -a instance address without going through a decommission
Proposed steps:
- systemctl mask cassandra
- puppet agent --disable
- nodetool drain
- reboot in single user mode
- mount /dev/mapper/<lv> /mnt
I guess /dev/mapper/<lv> isn't mounted at /var in single-user mode (i.e. no unmount is required)?
I'm not sure but good point, I'll amend the list
AFAIK, cassandra.replace_address is only or the case where you are bootstrapping a new node into the ring using the IP address of a previous, dead node. In this case, it should be enough to start the node up with a different IP to what it previously had (and that did work for me in testing).
sounds good, thanks for the clarification, I've updated the list of steps at https://phabricator.wikimedia.org/T113714#2296627 if that looks good I think we can try that today
Mentioned in SAL [2016-05-23T14:44:40Z] <godog> reboot restbase2005 in single user mode for T113714
@fgiunchedi completed the conversion of 2005 to 2005-a (in what looks like ~15 minutes); Everything looks perfect.
Good work @fgiunchedi!
Mentioned in SAL [2016-05-24T08:45:57Z] <godog> reboot restbase2006 for multi-instance conversion T113714
Mentioned in SAL [2016-05-24T09:18:19Z] <godog> reboot restbase2003 for multi-instance conversion T113714