Page MenuHomePhabricator

Install etcd in multiple rows/racks
Closed, ResolvedPublic

Description

Etcd is actually installed on VMs in eqiad, we want to move it/add to it 3 servers in different rack rows.

Those are going to be the 3 eqiad zookeeper hosts.

What we need is:

  • Move the ZK servers out of the analytics VLAN, reinstall with jessie. This will likely require us to do one server at a time
  • Add one server at a time to the etcd cluster

Event Timeline

Joe created this task.Jun 8 2015, 4:04 PM
Joe raised the priority of this task from to High.
Joe updated the task description. (Show Details)
Joe added a subscriber: Joe.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2015, 4:04 PM

@Joe, can I work on this next week when I am no longer in the office, or do you need to get it done sooner?

I made a little migration plan here:
http://etherpad.wikimedia.org/p/Jessie-Zookeeper

I think I can handle the new IP assignment as part of the reinstall. We will need coordination for network switch ACL and port VLAN changes.

TBD: Should we remove the $ANALYTICS_NETWORKS ferm restriction on zookeeper ports?
https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/zookeeper.pp#L81

@Cmjohnson FYI, we will be renaming analytlics1023,analytics1024,analytics1025 to conf1001,conf1002,conf1003 respectively.

Change 218634 had a related patch set uploaded (by Ottomata):
Prep for reinstall and rename of analytics zookeeper hosts

https://gerrit.wikimedia.org/r/218634

Change 218639 had a related patch set uploaded (by Ottomata):
New IPs for conf100[[123]

https://gerrit.wikimedia.org/r/218639

Change 218639 merged by Alexandros Kosiaris:
New IPs for conf100[[123]

https://gerrit.wikimedia.org/r/218639

Ok, got a partman recipe working (I think). I had to merge in order to test with d-i-test. @Joe could you post review it?

https://github.com/wikimedia/operations-puppet/blob/production/modules/install-server/files/autoinstall/partman/raid1-lvm-conf.cfg

I *think* it is working, although I do get an error on d-i-test near the end of the install process:

Unable to install GRUB in /dev/sda.  Executing 'grub-install /dev/sda' failed.  This is a fatal error.

Not sure if that is just an artifact of the fact that d-i-test is virtual and has /dev/vda. Or...do I need a /boot partition?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 16 2015, 9:02 PM

Change 218634 merged by Ottomata:
Prep for reinstall and rename of analytics zookeeper hosts

https://gerrit.wikimedia.org/r/218634

I am starting the ZK node reinstall now, coordinating network changes with Alex.

Change 218991 had a related patch set uploaded (by Ottomata):
analytics1023 -> conf1001

https://gerrit.wikimedia.org/r/218991

Change 218991 merged by Ottomata:
analytics1023 -> conf1001

https://gerrit.wikimedia.org/r/218991

Status report!

conf1001 has been installed and is running zookeeper. I will do conf1002 and conf1003 tomorrow.

DONE! conf1001-conf1003 are now Zookeeper hosts running Jessie. I hand them to you!

I have a few more post-migration tasks to do (documentation, Jessie jmxtrans package, etc.), but nothing that affects production operation.

Change 219379 had a related patch set uploaded (by Ottomata):
Remove site.pp references to analytics102[345] They have been renamed conf100[123]

https://gerrit.wikimedia.org/r/219379

Change 219379 merged by Ottomata:
Remove site.pp references to analytics102[345] They have been renamed conf100[123]

https://gerrit.wikimedia.org/r/219379

Change 219380 had a related patch set uploaded (by Ottomata):
Rename analytics102[345] mgmt entries to conf100[123]

https://gerrit.wikimedia.org/r/219380

Change 219380 merged by Ottomata:
Rename analytics102[345] mgmt entries to conf100[123]

https://gerrit.wikimedia.org/r/219380

jmxtrans package updated and installed.

RobH set Security to None.
Joe added a comment.Jun 26 2015, 2:31 PM

I added conf1001 to the cluster this morning. I will add the remainder in the next couple of hours.

Joe added a comment.Jun 29 2015, 11:46 AM

I added conf1002 and conf1003 too, and removed etcd1003 so that now the ganeti-based hosts don't have the quorum by themselves.

I may remove them alltoghether at a later point in time

Joe closed this task as Resolved.Jun 29 2015, 11:46 AM
Joe claimed this task.
Joe moved this task from In progress to Done on the discovery-system board.