Page MenuHomePhabricator

Fix restbase1017's physical rack
Closed, ResolvedPublic

Description

When working on T219404: rack/setup/install restbase10[19-27].eqiad.wmnet me and @Eevans discovered that the current restbase eqiad hosts physically in row C have their cassandra rack set to B (!). This discrepancy has always been there but somehow we've never run into big problems because of it. Since we're going to decom most of the misplaced hosts anyways as part of the parent task, we're going with this discrepancy a little while longer.

When the parent task is finished, the only host to be misplaced will be restbase1017, for which we'll have to decom in cassandra, physically move to row B, reimage with new IPs and let it join the cluster again.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMay 31 2019, 1:59 PM
Joe added a subscriber: Joe.May 31 2019, 2:02 PM

I think DC-Ops should sync with you all on a schedule for this move. According to @Eevans it would be desirable to schedule the physical move on a monday, so that the cassandra decommission can be started before the weekend.

I think DC-Ops should sync with you all on a schedule for this move. According to @Eevans it would be desirable to schedule the physical move on a monday, so that the cassandra decommission can be started before the weekend.

FTR, I think this suggestion was based on how long decommissioning would take (that we could work on that over a weekend, and be ready by Monday). On our end, we're just hoping to minimize downtime of the host by making sure everyone else is ready to proceed before we do the decommission. Whatever you setup, I'm sure we can make work. Thanks!

jijiki triaged this task as Normal priority.Jun 18 2019, 9:24 AM
jijiki added a subscriber: Cmjohnson.

@Eevans Do you still want to move this server? Let's coordinate a day/time

@Eevans Do you still want to move this server? Let's coordinate a day/time

Yes please!

If we're taking the approach of re-imaging after the move (which AFAIK is consensus), then we'll need a few days lead-time to decommission Cassandra, then we'll need someone to handle the re-imaging before handing it back off to us for bootstrap. If a Monday (say this one, or the next) worked for you (and whomever will do the re-image), then we could start the decommission on a Friday and have it done over the weekend.

@Eevans - can you reach out to Chris on IRC to schedule specific days for this? It's a short week because of July 4 and we have a data center conference next week, so I want to be sure you guys have something set aside on the calendar. Much appreciated. Thanks, Willy

Eevans added a comment.Jul 3 2019, 9:58 PM

From IRC, 2019-07-03T16:57:04-05:00:

4:51 PM <urandom> in this case, once it's moved, we need new IPs, DNS updated, and a Puppet changeset that reflects the new IPs, and then finally a reimage
4:51 PM <cmjohnson1> ok, I do that as well
4:52 PM <urandom> we'd been blocking on someone who could do that, and presumably coordinate the move with you, and then we were told to get with you
4:52 PM <urandom> oh!  cool, all of that?
4:52 PM <cmjohnson1> yep! Is it something that can be done anytime?
4:52 PM <+wikibugs> Operations, ops-eqiad, DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (RobH)
4:52 PM <urandom> sweeet
4:52 PM <urandom> we need to decommission first
4:53 PM <cmjohnson1> okay, Tuesday would be the first available day I can do it...will that work for you?
4:53 PM <urandom> it takes a day, two max
4:53 PM <urandom> that would work; we can do that
4:53 PM <cmjohnson1> I would like to do it 10/11am my time (eastern) 
4:54 PM <urandom> cmjohnson1: wfm, is that like...official?  shall I update the ticket?
4:54 PM <cmjohnson1> Yes, let's make that official

TL;DR CPT will have the machine decommissioned by Tuesday morning, and we'll carry out the move at that time.

Mentioned in SAL (#wikimedia-operations) [2019-07-07T17:25:55Z] <urandom> decommissioning restbase1017-a -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-07T20:13:19Z] <urandom> decommissioning restbase1017-b -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-08T14:43:28Z] <urandom> decommissioning restbase1017-c -- T222960

Eevans added a comment.Jul 8 2019, 6:18 PM

All 3 Cassandra instances are decommissioned; We are ready to begin

Change 521519 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Moving production dns entries for restbase1017

https://gerrit.wikimedia.org/r/521519

Change 521519 merged by Cmjohnson:
[operations/dns@master] Moving production dns entries for restbase1017

https://gerrit.wikimedia.org/r/521519

restbase1017 has been moved to rack B5
network port updated
DNS updated

Change 521525 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] restbase: update rb1017 Cassandra instances for rack move

https://gerrit.wikimedia.org/r/521525

@Eevans We did a test run for an install and the server was able to reach the installer without an issue. I did see on IRC something about stretch. I will leave that up to you if you like and the server can be installed whenever you need it.

Dzahn added a subscriber: Dzahn.Jul 9 2019, 7:25 PM

restbase1017 is shown as down in Icinga and has no downtime or comment . would be appreciated if you can schedule downtimes for planned maintenance. thanks

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097162

restbase1017 is shown as down in Icinga and has no downtime or comment . would be appreciated if you can schedule downtimes for planned maintenance. thanks
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097162

It was put under planned maintenance; We weren't expecting it to be down this long. I'll update Icinga.

@Eevans We did a test run for an install and the server was able to reach the installer without an issue. I did see on IRC something about stretch. I will leave that up to you if you like and the server can be installed whenever you need it.

Stretch would be preferable.

Stretch would be preferable.

I've merged a patch to our netboot.cfg so that the next reimage will install Stretch.

Any word on when we'll be imaging this machine?

Change 521525 merged by Dzahn:
[operations/puppet@production] restbase: update rb1017 Cassandra instances for rack move

https://gerrit.wikimedia.org/r/521525

Dzahn edited projects, added serviceops; removed Patch-For-Review.Jul 10 2019, 8:53 PM

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907102053_dzahn_114135_restbase1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase1017.eqiad.wmnet']

Of which those FAILED:

['restbase1017.eqiad.wmnet']

I have removed the ops-eqiad tag, if you have an issue that required DC ops please add the ops-eqiad tag back to the task.

Dzahn added a comment.Jul 11 2019, 5:49 PM

the server can be installed whenever you need it.

Yea, actually this still needs an OS on it. It was in a broken state and the reimage script failed as well.

@Dzahn, I need to know I don't know what that means? What does DC-ops need to troubleshoot? Thanks

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907112124_dzahn_140714_restbase1017_eqiad_wmnet.log.

Dzahn added a comment.EditedJul 11 2019, 11:31 PM

@Dzahn, I need to know I don't know what that means? What does DC-ops need to troubleshoot? Thanks

I meant to try using "wmf-auto-reimage-host" to install an OS. It failed on me yesterday and i just ran it because i saw this box was sitting in a busybox shell as if that failed before. Today i just repeated it and though the script again failed to detect that puppet had finished running i was able to SSH to it this time. We are good for now.

@Eevans has a shell account again and the system is now on stretch. Currently puppet fails with an error executing /usr/bin/scap deploy-local but i think that will be fixed by some manual steps that he can handle.

Dzahn assigned this task to Eevans.Jul 11 2019, 11:32 PM

This should be good to use now so you can take it back into service. Let us know if you need more merges.

Completed auto-reimage of hosts:

['restbase1017.eqiad.wmnet']

Of which those FAILED:

['restbase1017.eqiad.wmnet']

Mentioned in SAL (#wikimedia-operations) [2019-07-11T23:48:21Z] <eevans@deploy1001> Started deploy [cassandra/logstash-logback-encoder@d085ffa]: deploy logback to restbase1017 (T222960)

Mentioned in SAL (#wikimedia-operations) [2019-07-11T23:49:08Z] <eevans@deploy1001> Finished deploy [cassandra/logstash-logback-encoder@d085ffa]: deploy logback to restbase1017 (T222960) (duration: 00m 47s)

Change 522218 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/logstash-logback-encoder@master] Updated list of RESTBase hosts

https://gerrit.wikimedia.org/r/522218

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:01:25Z] <eevans@deploy1001> Started deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:01:51Z] <eevans@deploy1001> Finished deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) (duration: 00m 25s)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:03:15Z] <eevans@deploy1001> Started deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:03:20Z] <eevans@deploy1001> Finished deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) (duration: 00m 03s)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:58:58Z] <urandom> bootstrapping restbase1017-a -- T222960

Current snafu: None of data volumes are mounted (entries are missing from fstab). @fgiunchedi I seem to (vaguely) remember this as a thing, was the solution to add them manually?

Current snafu: None of data volumes are mounted (entries are missing from fstab). @fgiunchedi I seem to (vaguely) remember this as a thing, was the solution to add them manually?

I thought we fixed partman to add filesystems to fstab in T214166: Improve cassandra JBOD integration post-reimage but obviously not, I've reopened that task, in the meantime yes the bandaid is to add the filesystems manually

Mentioned in SAL (#wikimedia-operations) [2019-07-12T16:32:08Z] <urandom> bootstrapping restbase1017-a -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-12T18:02:04Z] <urandom> bootstrapping restbase1017-b -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-12T19:15:08Z] <urandom> bootstrapping restbase1017-c -- T222960

Eevans closed this task as Resolved.Jul 15 2019, 6:49 PM

All instances bootstrapped, and cleanups in corresponding rack are complete; Closing