Fix restbase1017's physical rack
Closed, ResolvedPublic
Actions

Description

When working on T219404: rack/setup/install restbase10[19-27].eqiad.wmnet me and @Eevans discovered that the current restbase eqiad hosts physically in row C have their cassandra rack set to B (!). This discrepancy has always been there but somehow we've never run into big problems because of it. Since we're going to decom most of the misplaced hosts anyways as part of the parent task, we're going with this discrepancy a little while longer.

When the parent task is finished, the only host to be misplaced will be restbase1017, for which we'll have to decom in cassandra, physically move to row B, reimage with new IPs and let it join the cluster again.

Details

Subject	Repo	Branch	Lines +/-
Updated list of RESTBase hosts	operations/software/logstash-logback-encoder	master	+21 -19
restbase: update rb1017 Cassandra instances for rack move	operations/puppet	production	+3 -3
Moving production dns entries for restbase1017	operations/dns	master	+9 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T197477 RESTBase storage capacity planning
Resolved	Eevans	T208087 Replace remaining Samsung SSDs
Resolved	Eevans	T222960 Fix restbase1017's physical rack

Event Timeline

fgiunchedi created this task.May 10 2019, 3:45 PM

• mobrovac mentioned this in T224553: Migrate remaining Restbase servers to Stretch.May 29 2019, 12:06 PM

Eevans mentioned this in T224623: Upgrade RESTBase cluster to Stretch.May 29 2019, 8:11 PM

Joe added a project: ops-eqiad.May 31 2019, 1:59 PM

Restricted Application added a project: SRE. · View Herald TranscriptMay 31 2019, 1:59 PM

I think DC-Ops should sync with you all on a schedule for this move. According to @Eevans it would be desirable to schedule the physical move on a monday, so that the cassandra decommission can be started before the weekend.

In T222960#5226150, @Joe wrote:

I think DC-Ops should sync with you all on a schedule for this move. According to @Eevans it would be desirable to schedule the physical move on a monday, so that the cassandra decommission can be started before the weekend.

FTR, I think this suggestion was based on how long decommissioning would take (that we could work on that over a weekend, and be ready by Monday). On our end, we're just hoping to minimize downtime of the host by making sure everyone else is ready to proceed before we do the decommission. Whatever you setup, I'm sure we can make work. Thanks!

jijiki triaged this task as Medium priority.Jun 18 2019, 9:24 AM

jijiki added a subscriber: • Cmjohnson.

@Eevans Do you still want to move this server? Let's coordinate a day/time

MoritzMuehlenhoff subscribed.Jun 28 2019, 9:29 AM

In T222960#5289614, @Cmjohnson wrote:

@Eevans Do you still want to move this server? Let's coordinate a day/time

Yes please!

If we're taking the approach of re-imaging after the move (which AFAIK is consensus), then we'll need a few days lead-time to decommission Cassandra, then we'll need someone to handle the re-imaging before handing it back off to us for bootstrap. If a Monday (say this one, or the next) worked for you (and whomever will do the re-image), then we could start the decommission on a Friday and have it done over the weekend.

@Eevans - can you reach out to Chris on IRC to schedule specific days for this? It's a short week because of July 4 and we have a data center conference next week, so I want to be sure you guys have something set aside on the calendar. Much appreciated. Thanks, Willy

From IRC, 2019-07-03T16:57:04-05:00:

4:51 PM <urandom> in this case, once it's moved, we need new IPs, DNS updated, and a Puppet changeset that reflects the new IPs, and then finally a reimage
4:51 PM <cmjohnson1> ok, I do that as well
4:52 PM <urandom> we'd been blocking on someone who could do that, and presumably coordinate the move with you, and then we were told to get with you
4:52 PM <urandom> oh!  cool, all of that?
4:52 PM <cmjohnson1> yep! Is it something that can be done anytime?
4:52 PM <+wikibugs> Operations, ops-eqiad, DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (RobH)
4:52 PM <urandom> sweeet
4:52 PM <urandom> we need to decommission first
4:53 PM <cmjohnson1> okay, Tuesday would be the first available day I can do it...will that work for you?
4:53 PM <urandom> it takes a day, two max
4:53 PM <urandom> that would work; we can do that
4:53 PM <cmjohnson1> I would like to do it 10/11am my time (eastern) 
4:54 PM <urandom> cmjohnson1: wfm, is that like...official?  shall I update the ticket?
4:54 PM <cmjohnson1> Yes, let's make that official

TL;DR CPT will have the machine decommissioned by Tuesday morning, and we'll carry out the move at that time.

WDoranWMF moved this task from Doing to Team 2 on the Platform Team Workboards board.Jul 5 2019, 5:46 PM

WDoranWMF edited projects, added Platform Team Workboards (Team 2); removed Platform Team Workboards (Doing).

WDoranWMF moved this task from Backlog to Doing on the Platform Team Workboards (Team 2) board.

Mentioned in SAL (#wikimedia-operations) [2019-07-07T17:25:55Z] <urandom> decommissioning restbase1017-a -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-07T20:13:19Z] <urandom> decommissioning restbase1017-b -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-08T14:43:28Z] <urandom> decommissioning restbase1017-c -- T222960

All 3 Cassandra instances are decommissioned; We are ready to begin

Change 521519 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Moving production dns entries for restbase1017

https://gerrit.wikimedia.org/r/521519

gerritbot added a project: Patch-For-Review.Jul 9 2019, 3:02 PM

Change 521519 merged by Cmjohnson:
[operations/dns@master] Moving production dns entries for restbase1017

https://gerrit.wikimedia.org/r/521519

restbase1017 has been moved to rack B5
network port updated
DNS updated

Maintenance_bot removed a project: Patch-For-Review.Jul 9 2019, 3:11 PM

Change 521525 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] restbase: update rb1017 Cassandra instances for rack move

https://gerrit.wikimedia.org/r/521525

gerritbot added a project: Patch-For-Review.Jul 9 2019, 3:51 PM

@Eevans We did a test run for an install and the server was able to reach the installer without an issue. I did see on IRC something about stretch. I will leave that up to you if you like and the server can be installed whenever you need it.

restbase1017 is shown as down in Icinga and has no downtime or comment . would be appreciated if you can schedule downtimes for planned maintenance. thanks

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097162

In T222960#5318812, @Dzahn wrote:

restbase1017 is shown as down in Icinga and has no downtime or comment . would be appreciated if you can schedule downtimes for planned maintenance. thanks

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&style=hostdetail&hoststatustypes=4&hostprops=2097162

It was put under planned maintenance; We weren't expecting it to be down this long. I'll update Icinga.

In T222960#5318560, @Cmjohnson wrote:

@Eevans We did a test run for an install and the server was able to reach the installer without an issue. I did see on IRC something about stretch. I will leave that up to you if you like and the server can be installed whenever you need it.

Stretch would be preferable.

In T222960#5319707, @Eevans wrote:

Stretch would be preferable.

I've merged a patch to our netboot.cfg so that the next reimage will install Stretch.

Any word on when we'll be imaging this machine?

Change 521525 merged by Dzahn:
[operations/puppet@production] restbase: update rb1017 Cassandra instances for rack move

https://gerrit.wikimedia.org/r/521525

Dzahn edited projects, added serviceops; removed Patch-For-Review.Jul 10 2019, 8:53 PM

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907102053_dzahn_114135_restbase1017_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['restbase1017.eqiad.wmnet']

Of which those FAILED:

['restbase1017.eqiad.wmnet']

Eevans mentioned this in T197477: RESTBase storage capacity planning.Jul 11 2019, 1:16 PM

• Cmjohnson removed a project: ops-eqiad.Jul 11 2019, 5:38 PM

I have removed the ops-eqiad tag, if you have an issue that required DC ops please add the ops-eqiad tag back to the task.

the server can be installed whenever you need it.

Yea, actually this still needs an OS on it. It was in a broken state and the reimage script failed as well.

@Dzahn, I need to know I don't know what that means? What does DC-ops need to troubleshoot? Thanks

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

restbase1017.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201907112124_dzahn_140714_restbase1017_eqiad_wmnet.log.

In T222960#5325727, @Cmjohnson wrote:

@Dzahn, I need to know I don't know what that means? What does DC-ops need to troubleshoot? Thanks

I meant to try using "wmf-auto-reimage-host" to install an OS. It failed on me yesterday and i just ran it because i saw this box was sitting in a busybox shell as if that failed before. Today i just repeated it and though the script again failed to detect that puppet had finished running i was able to SSH to it this time. We are good for now.

@Eevans has a shell account again and the system is now on stretch. Currently puppet fails with an error executing /usr/bin/scap deploy-local but i think that will be fixed by some manual steps that he can handle.

This should be good to use now so you can take it back into service. Let us know if you need more merges.

Completed auto-reimage of hosts:

['restbase1017.eqiad.wmnet']

Of which those FAILED:

['restbase1017.eqiad.wmnet']

Mentioned in SAL (#wikimedia-operations) [2019-07-11T23:48:21Z] <eevans@deploy1001> Started deploy [cassandra/logstash-logback-encoder@d085ffa]: deploy logback to restbase1017 (T222960)

Mentioned in SAL (#wikimedia-operations) [2019-07-11T23:49:08Z] <eevans@deploy1001> Finished deploy [cassandra/logstash-logback-encoder@d085ffa]: deploy logback to restbase1017 (T222960) (duration: 00m 47s)

Change 522218 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/software/logstash-logback-encoder@master] Updated list of RESTBase hosts

https://gerrit.wikimedia.org/r/522218

gerritbot added a project: Patch-For-Review.Jul 11 2019, 11:55 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:01:25Z] <eevans@deploy1001> Started deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:01:51Z] <eevans@deploy1001> Finished deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) (duration: 00m 25s)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:03:15Z] <eevans@deploy1001> Started deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:03:20Z] <eevans@deploy1001> Finished deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) (duration: 00m 03s)

Mentioned in SAL (#wikimedia-operations) [2019-07-12T00:58:58Z] <urandom> bootstrapping restbase1017-a -- T222960

Current snafu: None of data volumes are mounted (entries are missing from fstab). @fgiunchedi I seem to (vaguely) remember this as a thing, was the solution to add them manually?

In T222960#5327124, @Eevans wrote:

Current snafu: None of data volumes are mounted (entries are missing from fstab). @fgiunchedi I seem to (vaguely) remember this as a thing, was the solution to add them manually?

I thought we fixed partman to add filesystems to fstab in T214166: Improve cassandra JBOD integration post-reimage but obviously not, I've reopened that task, in the meantime yes the bandaid is to add the filesystems manually

Mentioned in SAL (#wikimedia-operations) [2019-07-12T16:32:08Z] <urandom> bootstrapping restbase1017-a -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-12T18:02:04Z] <urandom> bootstrapping restbase1017-b -- T222960

Mentioned in SAL (#wikimedia-operations) [2019-07-12T19:15:08Z] <urandom> bootstrapping restbase1017-c -- T222960

All instances bootstrapped, and cleanups in corresponding rack are complete; Closing

MoritzMuehlenhoff mentioned this in T224260: restbase-dev1006 has a broken disk.Jul 25 2019, 11:17 AM

Change 522218 merged by Eevans:
[operations/software/logstash-logback-encoder@master] Updated list of RESTBase hosts