Page MenuHomePhabricator

codfw: Relocate servers to make space for new switches in rowA and rowB
Closed, ResolvedPublic

Description

We are planning to refresh all the access switches in row A and B. To have the new switches up and running before moving servers to them we will have to rack them in parallel with the old switches to not have any disruption on the production network; For that, the best racking space for those switches will be 48,47and planning on moving also the mgmt switch to the top on the rack so U46 and U45 in each of the racks in row A and B except rack A1. Unfortunately we have servers in some of those racks that are using U space so we have to relocate those servers to another U space within the same rack of in another rack. Please see below for the list of servers to relocate. Thanks.

DoneServerRackU spaceRelocation rack/U spaceold switch portnew switch portNotes
YESlvs2007A247A2/U44same switch portsno changes
YESmw2401A345A5/U37ge-3/0/38ge-5/0/35
YESmw2411A545A5/U38ge-5/0/38ge-5/0/39
YESlvs2008B247B2/U44same switch portsno changes
YESmw2324B346B3/U1ge-3/0/40ge-3/0/0
YESmw2323B345B3/U2ge-3/0/39ge-3/0/1

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Papaul triaged this task as Medium priority.Jan 10 2023, 1:35 AM
Papaul moved this task from Backlog to Racking Tasks on the ops-codfw board.

@ayounsi @cmooney I have 2 questions
1- I have a total of 17 switches received so 1 is going to be used as the cloudsw in rack B1. Since B1 will be the WMCS rack are we going to put a switch(leaf switch) in that rack of keep 1 switch as spare
2- I know is is not planned yet but if you can provide me with the racking location(in which rack each spine will be racked) of the 2 spine switches that will be great

Thanks.

1/ 1 ToR per rack = 8x2 + 1 spare = 17, so indeed 1 dedicated to WMCS

2/ A1 and B1 would make sens, and would match eqiad E/F. But if there are cabling constraints we can revisit. Let me know what works best for you.

@ayounsi since A1 and A8 are supposed to be our network racks I will prefer possible to put one spine in A1 and the other spine in A8 and this makes the cabling much better. All the cables going to a single row then some cables gong to row A and other in row B.

I see, what would be the best later on for the rows C and D spines? C1/C8 or C1/D1 ?
Is using A1/A8 better for eqiad as well?

Papaul renamed this task from codfw: Relocate servers racked in U27 in all racks in rowA and rowB to codfw: Relocate servers to make space for new switches in rowA and rowB.Jan 18 2023, 1:23 AM
Papaul updated the task description. (Show Details)

We discuss this during today's meeting, we are going to put 1 spine in A1 and the other spine in A8. When we upgrade row C and D we will add 2 others spines that those will go in D1 and D8.

@BBlack Do you think you will have time for us to move lvs2007 this Thursday the 26th at 9:45am CT 2:45 pm UTC?

Thank you.

@Papaul - I can't make that slot for LVS, I have meetings a bit later that might get run over. @ssingh might be able to though!

@Papaul - I can't make that slot for LVS, I have meetings a bit later that might get run over. @ssingh might be able to though!

Happy to take care of it!

@BBlack @ssingh thank you. So the process is depool the server, power it down I move it and power it back no changes in the network or the cabling just the U space

Mentioned in SAL (#wikimedia-operations) [2023-01-26T15:09:06Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on lvs2007.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-01-26T15:09:21Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lvs2007.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-01-26T16:48:14Z] <sukhe> correcting earlier log: pooling lvs2007 after T326564

Papaul updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-03-20T14:29:44Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2008.codfw.wmnet with reason: T326564

Mentioned in SAL (#wikimedia-operations) [2023-03-20T14:30:10Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2008.codfw.wmnet with reason: T326564

I will be working with @Clement_Goubert today at 10am CT to relocate those mw nodes.

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:51:49Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on mw2401.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:52:02Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2401.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:52:07Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on mw2411.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:52:20Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2411.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:52:34Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on mw2324.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:52:47Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2324.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:52:52Z] <cgoubert@cumin1001> START - Cookbook sre.hosts.downtime for 3:00:00 on mw2323.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:53:15Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw2323.codfw.wmnet with reason: powering off for T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:53:48Z] <claime> Depooling mw2401 mw2411 mw2324 mw2323 as invalid for powerdown - T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T14:55:18Z] <claime> Powering down mw2401 mw2411 mw2324 mw2323 - T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T15:21:55Z] <claime> mw2401.codfw.wmnet repooled following T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T15:27:25Z] <claime> mw2411.codfw.wmnet repooled following T326564

Papaul updated the task description. (Show Details)
Papaul added a subscriber: ssingh.

This is complete, thanks to @ssingh and @Clement_Goubert

Mentioned in SAL (#wikimedia-operations) [2023-06-15T15:43:30Z] <claime> mw2324.codfw.wmnet repooled following T326564

Mentioned in SAL (#wikimedia-operations) [2023-06-15T15:44:50Z] <claime> mw2323.codfw.wmnet repooled following T326564