eqiad: move non WMCS servers out of rack D5
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ayounsi
	May 13 2022, 3:11 PM

Description

eqiad D5 has been dedicated to WMCS some time ago but still contains a few prod servers whose refresh date are far in the future. https://netbox.wikimedia.org/dcim/racks/39/

db1137 (@Marostegui)
druid1008 (@BTullis)
elastic1065 (@RKemper)
ganeti1020 (@MoritzMuehlenhoff)
restbase1026 (@hnowlan)
scandium (@hnowlan or @Dzahn)

Moving them out of this rack (to any row D rack) will have two main advantages:

Freeing up space and power for WMCS servers
Allows decommissioning asw2-d5-eqiad (and free up 40G ports)

As the hosts stay in the same vlan, the move should be along the lines of:

Downtime/depool
Power down
Physically move the server
Run the move server script https://netbox.wikimedia.org/extras/scripts/interface_automation.MoveServer/
Run Homer
Power the server back up

Details

	Subject	Repo	Branch	Lines +/-
	db1137: Disable notifications	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	cmooney	T291627 Packet Drops on Eqiad ASW -> CR uplinks
Resolved	Jclark-ctr	T313463 eqiad: upgrade row C and D uplinks from 4x10G to 1x40G
Resolved	• Cmjohnson	T313115 Move asw2-d5-eqiad to spares
Resolved	• Cmjohnson	T308331 eqiad: move non WMCS servers out of rack D5

Event Timeline

ayounsi created this task.May 13 2022, 3:11 PM

Please let me know when you'd like to get the database depooled and powered off.

No problem, with a few days of advance warning to drain the node we can easily move ganeti1020 any time.

Maintenance_bot added a project: SRE.May 13 2022, 3:29 PM

ayounsi mentioned this in T308339: eqiad: move non WMCS servers out of rack C8.May 13 2022, 3:45 PM

ayounsi mentioned this in T304712: eqiad: Move links to new MPC7E linecard.May 13 2022, 4:36 PM

regarding scandium: That just needs a heads up to @ssastry when the move is planned to happen. nothing much from my side here (if it just comes back as before). thanks!

Oh wait, does moving racks and running that cookbook mean IP addresses will change?

edit: as Rhinos points out you already said they stay in the same VLAN. so then.. nevermind. I was just asking because of mysql grants in this case.

As far as I know IPs won't change

restbase1026 can be moved with a few minutes notice without impact. Only requirement is that it stay in a D rack, as stated

• Marostegui triaged this task as Medium priority.May 17 2022, 8:17 AM

RKemper updated the task description. (Show Details)May 18 2022, 4:45 AM

For elastic1065 we just need some advance notice (~24h) so we can depool & ban [at the elasticsearch cluster level] the host. I'd do that now but it's not clear from the ticket if these hosts will be moved soon or not, so I'll wait for further heads-up.

RKemper added a subscriber: bking.May 18 2022, 7:40 PM

• Cmjohnson moved this task from Backlog to Lower Priority Items on the ops-eqiad board.May 31 2022, 4:41 PM

ayounsi mentioned this in T291627: Packet Drops on Eqiad ASW -> CR uplinks.Jul 4 2022, 8:47 AM

RhinosF1 subscribed.Jul 4 2022, 9:07 AM

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry Do any of your servers require 10G? I should be able to keep them all in row D, this would only be an in-row move and would not require new IPs, and would have limited downtime. 15-20 minutes per server. I would like to schedule all servers for next Tuesday at 1530UTC. Please let me know if this works for you and about 10G requirement.

1530-1550 db1137 -> D3 U13
1550-1620 elastic1065 -> D3 U14
1620-1650 scandium -> D3 U8
1650-1710 ganeti1020 -> D6 U2
1710-1730 restbase1026 -> D3 U5
1730-1750 druid1008 -> D6 U7

@ayounsi confirmed they're all 1G, I added the racks and U# to the timeslots

@Cmjohnson db1137 does not need 10G and can be moved Tuesday 15:30 UTC - I will get the host ready for you.

@Cmjohnson Newly procured Ganeti servers use 10G, but ganeti1020 still has 1G only. I'll get it ready by Tuesday.

Mentioned in SAL (#wikimedia-operations) [2022-07-07T07:07:34Z] <moritzm> drain ganeti1020 T308331

I am hoping @Dzahn can answer the question for me for scandium since I don't know what the 10G requirement means.

• Cmjohnson claimed this task.Jul 7 2022, 7:55 PM

In T308331#8061619, @ssastry wrote:

I am hoping @Dzahn can answer the question for me for scandium since I don't know what the 10G requirement means.

scandium doesn't use/need 10G, it can also be moved.

ganeti1020 is now emptied of VMs and can be moved.

Change 813208 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1137: Disable notifications

https://gerrit.wikimedia.org/r/813208

gerritbot added a project: Patch-For-Review.Jul 12 2022, 10:12 AM

Mentioned in SAL (#wikimedia-operations) [2022-07-12T10:12:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1137 for onsite maintenance T308331', diff saved to https://phabricator.wikimedia.org/P31017 and previous config saved to /var/cache/conftool/dbconfig/20220712-101211-root.json

Change 813208 merged by Marostegui:

[operations/puppet@production] db1137: Disable notifications

https://gerrit.wikimedia.org/r/813208

@Cmjohnson db1137 is now off and ready to be moved anytime.

Maintenance_bot removed a project: Patch-For-Review.Jul 12 2022, 10:30 AM

Mentioned in SAL (#wikimedia-operations) [2022-07-12T12:01:54Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Rack move, T308331

Mentioned in SAL (#wikimedia-operations) [2022-07-12T12:02:09Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Rack move, T308331

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry I am beginning to move servers in a few minutes, please ping me in IRC if you have any questions.

Mentioned in SAL (#wikimedia-operations) [2022-07-12T14:46:50Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on druid1008.eqiad.wmnet with reason: T308331 btullis

Mentioned in SAL (#wikimedia-operations) [2022-07-12T14:47:04Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on druid1008.eqiad.wmnet with reason: T308331 btullis

Thanks @Cmjohnson - I've added 3 hours of downtime for druid1008 - but feel free to add more if appropriate.

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry All your servers are moved,

@MoritzMuehlenhoff I am not able to ssh into yours, I am not sure if that is expected. Can you please verify

@hnowlan @Eevans You were not included in the original ping, I would like to move your restbase server as soon as possible, Can I do this Wednesday 7/13 at 1530UTC.

All servers are back up, @MoritzMuehlenhoff I had to make the private1 vlan the native vlan

In T308331#8072918, @Cmjohnson wrote:

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry All your servers are moved,

@MoritzMuehlenhoff I am not able to ssh into yours, I am not sure if that is expected. Can you please verify

@hnowlan @Eevans You were not included in the original ping, I would like to move your restbase server as soon as possible, Can I do this Wednesday 7/13 at 1530UTC.

Works for me!

Thank you Chris - just started db1137 again.

https://netbox.wikimedia.org/dcim/devices/2612/ and https://netbox.wikimedia.org/dcim/devices/2252/ still show up as being in rack D5 but cabled to a different ToR switches, so I guess it's just a matter of updating Netbox with the new rack/U location.

ayounsi mentioned this in T313115: Move asw2-d5-eqiad to spares.Jul 15 2022, 8:19 AM

ayounsi added a parent task: T313115: Move asw2-d5-eqiad to spares.

ayounsi mentioned this in T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G.Jul 21 2022, 6:39 AM

ayounsi added a parent task: T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G.