Page MenuHomePhabricator

eqiad: move non WMCS servers out of rack C8
Closed, ResolvedPublic

Description

Similar to T308331

eqiad C8 has been dedicated to WMCS some time ago but still contains a few prod servers whose refresh date are far in the future. https://netbox.wikimedia.org/dcim/racks/24/

  • an-tool1010 (@BTullis)
  • db1131 (marostegui) - 5 years old and being replaced in Q2 via T344036. - T308339#9033969 - decomission via T350141
  • db1135
  • dbproxy1021
  • deploy1002 (akosiaris) - 2023-07-21 update Rob/Alex chat : If this waits for sept 20th switchover then this can move without notice. Currently primary deploy host in eqiad and has long running users in screen sessions.
  • elastic1059
  • ganeti1012
  • mw1408
  • mw1409
  • mw1410
  • mw1411
  • mw1412
  • mw1413

Moving them out of this rack (to any row C rack) will have two main advantages:

  • Freeing up space and power for WMCS servers
  • Allows decommissioning asw2-c8-eqiad (less urgent than D5 as we have free 40G ports)

As the hosts stay in the same vlan, the move should be along the lines of:

  1. Downtime/depool
  2. Power down
  3. Physically move the server
  4. Run the move server script https://netbox.wikimedia.org/extras/scripts/interface_automation.MoveServer/
  5. Run Homer
  6. Power the server back up

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Please let me know when you'd like to get the databases and dbproxy depooled and powered off.

Self note: dbproxy1021 isn't active

Marostegui triaged this task as Medium priority.May 17 2022, 8:17 AM

For elastic1059 we just need some advance notice (~24h) so we can depool & ban [at the elasticsearch cluster level] the host. I'd do that now but it's not clear from the ticket if these hosts will be moved soon or not, so I'll wait for further heads-up.

deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host.

the mw* hosts can be done at any time I guess.

What's the timeline for this?

What's the timeline for this?

Current step is to gather limitations that can impact scheduling. Then it will depends on DCops. I'd like D5 to happen in the next 3 months. Less urgency for C8.

Please let us know before proceeding with this as now db1131 is a master so we'd need to switch it back to become a single replica. So please let us know before hand with 2-3 days of heads up so we can schedule it.

@Marostegui row C is very tight and I can only move 2 of the several servers that need to go. I would like to move these 2 of yours first. Can we schedule this for tomorrow 14 July @ 1700UTC

db1135 (@Marostegui)
dbproxy1021 (@Marostegui)

@Cmjohnson that works for me. I will get those two hosts ready for you

Change 813835 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1135,dbproxy1021: Disable notifications

https://gerrit.wikimedia.org/r/813835

Change 813835 merged by Marostegui:

[operations/puppet@production] db1135,dbproxy1021: Disable notifications

https://gerrit.wikimedia.org/r/813835

@Cmjohnson db1135 and dbproxy1021 are now off and ready for the move.

The remainder of these server moves can happen once we are able to resolve T306162. That will free up space in rack d6. Currently, this row is at maximum capacity for 1G servers.

ayounsi mentioned this in Unknown Object (Task).Jul 20 2022, 8:12 AM
ayounsi mentioned this in Unknown Object (Task).Nov 17 2022, 4:21 PM

Mentioned in SAL (#wikimedia-operations) [2022-11-18T10:34:45Z] <moritzm> draining ganeti1012 in preparation of server move to a new rack T308339

@Cmjohnson - Let me knowhen you're ready to move an-tool1010 please. I'll schedule a maintenance window for Superset and shut it down for you.
Am I right in assuming that you'll want to finish T306162 first for this server as well? Thanks.

++@Jclark-ctr, since @Cmjohnson will be out for a while

@Cmjohnson - Let me knowhen you're ready to move an-tool1010 please. I'll schedule a maintenance window for Superset and shut it down for you.
Am I right in assuming that you'll want to finish T306162 first for this server as well? Thanks.

ganeti1012 can be powered down for the rack move; the remaining three VMs are redundant and have been silenced in monitoring.

I am removing the DBA tag from this task as there are no more databases pending here. I will remain subscribed in case I am needed.

Oh, nevermind, db1131 is still to be moved.

Yep, the list of servers on the task description is up to date.

deploy1002 will need to be scheduled well in advance and/or failed over to deploy2002 as it is the canonical deployment host.

@akosiaris As we're in the DC switchover and 2002 is the active one, should we tackle 1002 sooner than later?

@RobH mw hosts are 3 api servers and 3 appservers. You can do them anytime. Also it requires is a downtime and a poweroff per the description.

Mentioned in SAL (#wikimedia-sre) [2023-07-19T14:07:07Z] <robh> mw141[23] downtimes and relocating per T308339

Mentioned in SAL (#wikimedia-sre) [2023-07-19T14:48:43Z] <robh> mw141[34] returned to service per T308339

Mentioned in SAL (#wikimedia-sre) [2023-07-19T14:49:33Z] <robh> mw141[23] returned to service per T308339. ignore typo of mw1414 it is uninvolved

Mentioned in SAL (#wikimedia-sre) [2023-07-19T15:11:54Z] <robh> mw141[01] returned to service per T308339

Mentioned in SAL (#wikimedia-sre) [2023-07-19T15:14:30Z] <robh> mw140[89] downtime for relocation per T308339

For db1131, it is a master. When do you plan to do this? I'd need a couple of days to remove its master role.

For db1131, it is a master. When do you plan to do this? I'd need a couple of days to remove its master role.

So when I checked the list of servers, any server over 5 years I deemed not worth trying to rush this week (as I was only on-site this single week) to get migrated. As we tend to replace most servers at 5 years, I noticed that db1131 is slated for replacement with the order of T341269.

IMO, it isn't worth the maint time to migrate a server that will go away sometime in Q2 (as its replacement will arrive and become available by end of Q1.)

Sound reasonable?

RobH updated the task description. (Show Details)

Sounds good to me Rob :-)
If for any other reason we end up switching that host's role before Q2, I'll comment on this task

@RobH db1131 is no longer a master, it can be moved if we want to. We just need to depool it and stop mariadb beforehand.

@Jclark-ctr When would you like to move an-tool1010 ?
It is the single host behind superset.wikimedia.org so I'd like to give our users a little bit of notice it it's going to be down for more than 10 minutes or so. Thanks.

@RobH, Switchover was done yesterday, we are now in codfw for the next 6 months, deploy1002 is no longer used. It can be powered off and moved whenever ops-eqiad feels like it.

@RobH I just want to make sure you saw Alex's message to you above. iirc you took care of some of the other moves.

@Jclark-ctr or @VRiley-WMF - can one of you follow up on Ben's question above on an-tool1010, along with Alex's comment on deploy1102? Thanks, Willy

The following has been re-racked

deploy1002 - C 3, U 34, CableID 3750, port 40

Ran script and powered the unit on.

Relocated an-tool1010 to rack C3

Jclark-ctr updated the task description. (Show Details)