Page MenuHomePhabricator

rack/setup/install restbase-dev100[456]
Closed, ResolvedPublic

Description

This task will track the receiving, racking, and setup of three new restbase-dev hosts ordered on T161534.

Please note that these will be using the SSDs from restbase-dev100[123] and can basically simply take their place in the racks. They can also go in any other racks if needed. Please have all three hosts in different racks and rows from one another. Placement of them in relation to restbase-dev100[123] is immaterial, since the older systems will go offline upon arrival of these new hosts, to steal their SSDs for these. Please note the SSD item in the notes in racktables for these hosts.

restbase-dev1004:

  • - receive in system on procurement task T161534 (They'll come with empty drive sleds).
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - confirm offlining of restbase-dev1001 with Services , steal SSDs from it to install in this system.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal subnet)
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update https://gerrit.wikimedia.org/r/#/c/366572/
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

restbase-dev1005:

  • - receive in system on procurement task T161534 (They'll come with empty drive sleds).
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - confirm offlining of restbase-dev1002 with Services , steal SSDs from it to install in this system.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal subnet)
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update https://gerrit.wikimedia.org/r/#/c/366572/
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

restbase-dev1006:

  • - receive in system on procurement task T161534 (They'll come with empty drive sleds).
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - confirm offlining of restbase-dev1003 with Services , steal SSDs from it to install in this system.
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - production dns entries added (internal subnet)
  • - network port setup (description, enable, internal vlan)
  • - operations/puppet update https://gerrit.wikimedia.org/r/#/c/366572/
  • - OS installation
  • - puppet/salt accept/initial run
  • - handoff for service implementation

Event Timeline

We've discussed this in the ops/services syncup today. Since the SSDs will be moved as-is from old hardware the simplest plan I proposed is to reimage the machines but change the partman recipe to keep the raid0 so cassandra data is not wiped.

To summarize some discussion of this that took place in the ops-services-syncup meeting today:

  • Services has a couple weeks worth of data sampled from production that we'd ideally be able to preserve
  • One option discussed would be to transplant the drives to the new machines without any reimaging, and then reassign hostnames and IPs accordingly
  • Another option discussed was to transplant the disks to the new machines, and reinstall the OS while preserving the RAID configuration (the data we are interested in exists in a RAID0 mounted as /srv)
  • If the machines were done one-by-one, an existing machine could have it's instances decommissioned before moving the disks. The host receiving the new disks, once complete, would then bootstrap it's instances before progressing to the next host.
  • Finally, if there were somewhere on the network with sufficient disk space, the existing data files could be rsynced prior to cannibalizing the disks (at the time of this writing, we'd need about 600G).

We've discussed this in the ops/services syncup today. Since the SSDs will be moved as-is from old hardware the simplest plan I proposed is to reimage the machines but change the partman recipe to keep the raid0 so cassandra data is not wiped.

If we can do it this way, this would be great.

fgiunchedi renamed this task from rack/setup/install resetbase-dev100[456] to rack/setup/install restbase-dev100[456].May 25 2017, 4:16 PM
  • Finally, if there were somewhere on the network with sufficient disk space, the existing data files could be rsynced prior to cannibalizing the disks (at the time of this writing, we'd need about 600G).

I didn't realize the data would be manageable, so yes the rsync road is certainly more reliable but slower since it is all 1G interfaces AFAIK. I believe e.g. lithium would have enough space for this.

@Eevans, is there anything left to do here?

GWicke moved this task from Backlog to watching on the Services board.
GWicke edited projects, added Services (watching); removed Services.

Change 364792 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding production dns for restbase-dev100[4-6] T166181

https://gerrit.wikimedia.org/r/364792

Change 364792 merged by Cmjohnson:
[operations/dns@master] Adding production dns for restbase-dev100[4-6] T166181

https://gerrit.wikimedia.org/r/364792

RobH added a subscriber: Cmjohnson.

I've asked in #wikimedia-services, but IRC is not a permanent medium, so I'll also ask here. @Eevans already responded to this thread about the SSD move, so I've assigned to him for feedback. If this isn't correct, I apologize, just trying to get this figured out! Please reassign to whoever can confirm the downtime of the restbase-dev100[123] for 2017-07-20 or 2017-07-21.

@Cmjohnson has all three of these new systems racked and remotely accessible. However, none can be installed until restbase-dev100[123] are taken offline, and their SSDs migrated into restbase-dev[456].

Ideally, this happens all at the same time. Since this has a slight time-frame constraint (I imagine services doesn't want to go for more than a day or two without their restbase-dev cluster), Chris advises he can schedule to do this on 2017-07-20 or 2017-07-21.

Please advise, and assign back to @Cmjohnson with feedback.

Thanks in advance!

I've asked in #wikimedia-services, but IRC is not a permanent medium, so I'll also ask here. @Eevans already responded to this thread about the SSD move, so I've assigned to him for feedback. If this isn't correct, I apologize, just trying to get this figured out! Please reassign to whoever can confirm the downtime of the restbase-dev100[123] for 2017-07-20 or 2017-07-21.

That time frame is OK with me.

As mentioned earlier, we would really like to preserve the data if at all possible. As @RobH suggests, it would be enough to sync the data from two of the hosts to one via rsync::quickdatacopy (example of that). We could then move the disks from those two hosts to setup two of the new machines, and rsync the data to them from the third before reclaiming its disks.

Or an external USB disk would work too (it's ~750G total).

The rsycn should technically be easier, so I'd like to try it out first. Can you detail what data has to be backed up from the hosts, and where is best to shove it on the third host?

For now, I'm assuming we copy all the data from restbase-dev100[12] to restbase-dev1003. Then we can migrate the SSDs from restbase-dev100[12] into restbase-dev100[45] and reimage. Once reimage is done, data can be coped from restbase-dev1003 (with its backup data of 100[12]) to the new hosts and 1003 can be taken down for SSD migration and reimage.

The rsycn should technically be easier, so I'd like to try it out first. Can you detail what data has to be backed up from the hosts...

We need the full contents of /srv/cassandra-{a,b}/.

..., and where is best to shove it on the third host?

On restbase-dev1003, this will probably need to be copied somewhere under /srv/. Obviously, we'll need it separated by machine accordingly, so maybe something like: /srv/backups/restbase-dev100[1-2]/cassandra-{a,b}?.

For now, I'm assuming we copy all the data from restbase-dev100[12] to restbase-dev1003. Then we can migrate the SSDs from restbase-dev100[12] into restbase-dev100[45] and reimage. Once reimage is done, data can be coped from restbase-dev1003 (with its backup data of 100[12]) to the new hosts and 1003 can be taken down for SSD migration and reimage.

Yup, this sounds workable; Thanks @RobH !

@RobH, @Cmjohnson are we still on track for offlining restbase-dev100[1-3] and rsyncing the data/moving the disks to restbase-dev100[4-6] tomorrow and Friday?

@RobH, @Cmjohnson are we still on track for offlining restbase-dev100[1-3] and rsyncing the data/moving the disks to restbase-dev100[4-6] tomorrow and Friday?

I'm still good with this, I've left @Cmjohnson a PM via IRC to confirm.

@Eevans We are definitely good for tomorrow. Do we need to do both days?

@Cmjohnson we didn't setup the rsync in advance, so we'll have to use a USB HDD/SSD to copy over some data before the migration.

We want to leave 1003 online until after 1004 and 1005 are back fully online with all of their data restored, then we can take down 1003. That is why its likely two days.

@RobH @Eevans let's move the 2nd day until Monday or Tuesday. migration from a disk will be slow and to make sure that all is working correctly before moving to the next.

To summarize an IRC discussion: If preserving the data will complicate matters such that we won't have the cluster back on-line until sometime next week, then let's not do that; Please go ahead without preserving the data.

Mentioned in SAL (#wikimedia-operations) [2017-07-20T13:29:42Z] <cmjohnson1> downtimed restbase-dev100[1-3] to power off and move ssds to newly racked restbase-dev100[4-6] phab task: T166181

Change 366572 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] restbase-dev100[456] replacing restbase-dev100[123]

https://gerrit.wikimedia.org/r/366572

Change 366572 merged by RobH:
[operations/puppet@production] restbase-dev100[456] replacing restbase-dev100[123]

https://gerrit.wikimedia.org/r/366572

Change 366596 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting restbase-dev to role:spare

https://gerrit.wikimedia.org/r/366596

Change 366596 merged by RobH:
[operations/puppet@production] setting restbase-dev to role:spare

https://gerrit.wikimedia.org/r/366596

RobH updated the task description. (Show Details)

Assigned to @Eevans for followup. These are ready to be used by services, and this task can be resolved once acknowledged.

Thanks!

Change 366604 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] adding additional IPs for cassandra instances on restbase-dev100[456]

https://gerrit.wikimedia.org/r/366604

Change 366604 merged by RobH:
[operations/dns@master] adding additional IPs for cassandra instances on restbase-dev100[456]

https://gerrit.wikimedia.org/r/366604

We're good on the Services side of things; Thanks for the help!