Page MenuHomePhabricator

eqiad: move non WMCS servers out of rack D5
Closed, ResolvedPublic

Description

eqiad D5 has been dedicated to WMCS some time ago but still contains a few prod servers whose refresh date are far in the future. https://netbox.wikimedia.org/dcim/racks/39/

db1137 (@Marostegui)
druid1008 (@BTullis)
elastic1065 (@RKemper)
ganeti1020 (@MoritzMuehlenhoff)
restbase1026 (@hnowlan)
scandium (@hnowlan or @Dzahn)

Moving them out of this rack (to any row D rack) will have two main advantages:

  • Freeing up space and power for WMCS servers
  • Allows decommissioning asw2-d5-eqiad (and free up 40G ports)

As the hosts stay in the same vlan, the move should be along the lines of:

  1. Downtime/depool
  2. Power down
  3. Physically move the server
  4. Run the move server script https://netbox.wikimedia.org/extras/scripts/interface_automation.MoveServer/
  5. Run Homer
  6. Power the server back up

Event Timeline

Marostegui moved this task from Triage to Blocked on the DBA board.

Please let me know when you'd like to get the database depooled and powered off.

No problem, with a few days of advance warning to drain the node we can easily move ganeti1020 any time.

regarding scandium: That just needs a heads up to @ssastry when the move is planned to happen. nothing much from my side here (if it just comes back as before). thanks!

Oh wait, does moving racks and running that cookbook mean IP addresses will change?

edit: as Rhinos points out you already said they stay in the same VLAN. so then.. nevermind. I was just asking because of mysql grants in this case.

restbase1026 can be moved with a few minutes notice without impact. Only requirement is that it stay in a D rack, as stated

For elastic1065 we just need some advance notice (~24h) so we can depool & ban [at the elasticsearch cluster level] the host. I'd do that now but it's not clear from the ticket if these hosts will be moved soon or not, so I'll wait for further heads-up.

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry Do any of your servers require 10G? I should be able to keep them all in row D, this would only be an in-row move and would not require new IPs, and would have limited downtime. 15-20 minutes per server. I would like to schedule all servers for next Tuesday at 1530UTC. Please let me know if this works for you and about 10G requirement.

1530-1550 db1137 -> D3 U13
1550-1620 elastic1065 -> D3 U14
1620-1650 scandium -> D3 U8
1650-1710 ganeti1020 -> D6 U2
1710-1730 restbase1026 -> D3 U5
1730-1750 druid1008 -> D6 U7

@ayounsi confirmed they're all 1G, I added the racks and U# to the timeslots

@Cmjohnson db1137 does not need 10G and can be moved Tuesday 15:30 UTC - I will get the host ready for you.

@Cmjohnson Newly procured Ganeti servers use 10G, but ganeti1020 still has 1G only. I'll get it ready by Tuesday.

I am hoping @Dzahn can answer the question for me for scandium since I don't know what the 10G requirement means.

I am hoping @Dzahn can answer the question for me for scandium since I don't know what the 10G requirement means.

scandium doesn't use/need 10G, it can also be moved.

ganeti1020 is now emptied of VMs and can be moved.

Change 813208 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1137: Disable notifications

https://gerrit.wikimedia.org/r/813208

Mentioned in SAL (#wikimedia-operations) [2022-07-12T10:12:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1137 for onsite maintenance T308331', diff saved to https://phabricator.wikimedia.org/P31017 and previous config saved to /var/cache/conftool/dbconfig/20220712-101211-root.json

Change 813208 merged by Marostegui:

[operations/puppet@production] db1137: Disable notifications

https://gerrit.wikimedia.org/r/813208

@Cmjohnson db1137 is now off and ready to be moved anytime.

Mentioned in SAL (#wikimedia-operations) [2022-07-12T12:01:54Z] <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Rack move, T308331

Mentioned in SAL (#wikimedia-operations) [2022-07-12T12:02:09Z] <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1020.eqiad.wmnet with reason: Rack move, T308331

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry I am beginning to move servers in a few minutes, please ping me in IRC if you have any questions.

Mentioned in SAL (#wikimedia-operations) [2022-07-12T14:46:50Z] <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on druid1008.eqiad.wmnet with reason: T308331 btullis

Mentioned in SAL (#wikimedia-operations) [2022-07-12T14:47:04Z] <btullis@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on druid1008.eqiad.wmnet with reason: T308331 btullis

Thanks @Cmjohnson - I've added 3 hours of downtime for druid1008 - but feel free to add more if appropriate.

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry All your servers are moved,

@MoritzMuehlenhoff I am not able to ssh into yours, I am not sure if that is expected. Can you please verify

@hnowlan @Eevans You were not included in the original ping, I would like to move your restbase server as soon as possible, Can I do this Wednesday 7/13 at 1530UTC.

All servers are back up, @MoritzMuehlenhoff I had to make the private1 vlan the native vlan

@RKemper @Marostegui @Dzahn @MoritzMuehlenhoff @BTullis @ssastry All your servers are moved,

@MoritzMuehlenhoff I am not able to ssh into yours, I am not sure if that is expected. Can you please verify

@hnowlan @Eevans You were not included in the original ping, I would like to move your restbase server as soon as possible, Can I do this Wednesday 7/13 at 1530UTC.

Works for me!

Thank you Chris - just started db1137 again.

https://netbox.wikimedia.org/dcim/devices/2612/ and https://netbox.wikimedia.org/dcim/devices/2252/ still show up as being in rack D5 but cabled to a different ToR switches, so I guess it's just a matter of updating Netbox with the new rack/U location.