Page MenuHomePhabricator

Move servers off asw2-a5-eqiad
Closed, ResolvedPublic

Description

asw2-a5-eqiad is very old. As it's a 10G switch, and the 10G racks are now A2, A4 and A7, some servers will have to physically move, some only have their uplink moved to the new A5 switch.

We need @Cmjohnson to confirm we have enough rack space in racks A2, A4, A7 for the servers that need to physically move.

Ideally tackle the server in a spare::system role before they start being used.

By the look of it, I'm not sure we need to schedule that work in a special window but I'm fine either way.

Need to physically move from A7 to A5:

Interfacenamestatusnote
wtp1028-1030TODOCan be moved anytime
analytics1071TODOSync up with @elukey
ores1002TODOCan be moved anytime
ganeti1008TODOempty the nodes from live VMs first @akosiaris

Need to physically move:

Interfacenamestatusnote
xe-0/0/1sodiumTODOcan probably be moved anytime to any row, @faidon might know
xe-0/0/3cloudelastic1001TODOCan be moved to anytime (role spare::system) - T194186
xe-0/0/5cp1076TODOcan be moved anytime after depool
xe-0/0/15cp1078TODOcan be moved anytime after depool
xe-0/0/6ms-be1040TODOcan be moved anytime [confirmed]
xe-0/0/17ms-be1028TODOcan be moved anytime [confirmed]
xe-0/0/29ms-fe1005TODOcan be moved anytime after depool, ms-fe100[56] depool one at a time
xe-0/0/30ms-fe1006TODOcan be moved anytime after depool, ms-fe100[56] depool one at a time
xe-0/0/34ms-be1029TODOcan be moved anytime [confirmed]
xe-0/0/35ms-be1030TODOcan be moved anytime [confirmed]

No need to physically move server:

Interfacenamestatusnote
xe-0/0/4lvs1016:enp4s0f1 {#3917}TODOMove to different member after sync with Traffic - T184293
ge-0/0/7dbproxy1012TODOCan be moved to asw2:fpc5 anytime (role spare::system)
ge-0/0/14cloudstore1008TODOCan be moved to asw2:fpc5 anytime (role spare::system)

To be decommissioned:

Interfacenamestatusnote
xe-0/0/8lvs1007TODOTo be decom - T208586
xe-0/0/11lvs1010:eth1TODOTo be decom - T208586
xe-0/0/12lvs1011:eth1TODOTo be decom - T208586
xe-0/0/13lvs1012:eth1TODOTo be decom - T208586
xe-0/0/18cp1058TODOTo be decom - T208584
xe-0/0/19cp1059TODOTo be decom - T208584
xe-0/0/20cp1060TODOTo be decom - T208584
xe-0/0/21cp1061TODOTo be decom - T208584
xe-0/0/22cp1062TODOTo be decom - T208584
xe-0/0/23cp1063TODOTo be decom - T208584
xe-0/0/24cp1064TODOTo be decom - T208584
xe-0/0/25cp1065TODOTo be decom - T208584
xe-0/0/26cp1066TODOTo be decom - T208584
xe-0/0/27cp1067TODOTo be decom - T208584
xe-0/0/28cp1068TODOTo be decom - T208584

Event Timeline

ayounsi triaged this task as Medium priority.Dec 19 2018, 10:05 PM
ayounsi created this task.

re: ms-fe hosts, if possible please do not co-locate onto the same rack in row A. Less stringent but the more ms-be hosts are spread out the better.

I will need to create space in the 10G racks to make this work and some juggling will be required.

I need to move several servers out of rack A4 to make room for (3)ms-be servers, cp1076 and cloudelastic1001. Each of the ms-be servers is 2u. I am not sure what would be the best and easiest to move (see my comment below about A2)

I will need to move the following to make room for 2 ms-be servers and 1 ms-fe1005, cp1078 and sodium in A7
wtp1028-1030 from A7 to A5
analytics1071 from A7 to A5
ores1002 from A7 to A5
ganeti1008 from A7 to A5

Rack A2 is currently full because of limitations with available power, if it's preferred to swap PDU sooner rather than later, that will add significant u space and we will not need to find as much space in A4. However, for redundancy, we should look at moving 2 if not 3 of the ms-be servers to A4.

@ayounsi I want to do all the server moves on Thursday this week. Can you ask the service owners to have everything depooled. I will get started at 1500 UTC. The server move will take a couple of hours and then we can do the network changes afterward. I think physical moving will not exceed 3 hours.

wtp1028-1030 can indeed be moved at anytime, provided they are shutdown gracefully.
ores1002 can indeed be moved at anytime, provided it is shutdown gracefully.
ganeti1008 This one will need to be emptied of VMs. It is easy and rather quick (some 10-30mins) to do so per https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node. I can help with that, just ping me on/prior the day of the move

I neglected to update this task with the email I sent out yesterday.

This task is being tracked on: https://phabricator.wikimedia.org/T212348

TL;DR: If you got this email, look below at the list of servers and your name. You need to review the migration plan listed for your server. Full details below.

If there are any changes required, PLEASE UPDATE THE TASK DIRECTLY RATHER THAN REPLY TO THIS EMAIL.

Full Details:

Please note that you are being emailed due to being one of the service owners of a system located in rack A5-eqiad. (Or you are one of the few service owners who will have their system move immediately into A5 post-switch-upgrade.)

A5-eqiad currently has a stand alone (not in the switch stack/fabric) EX4500 with 10G connections. As part of the EQIAD network switch upgrades, it will be replaced. While row A will end up with 3 10G racks, A5 will not be one of them. So all 10G systems in A5 will need to relocate into one of the 10G racks within row A.

Chris has already reviewed his racking space in row A, and doesn't expect any issues with relocating ALL affected servers within row A itself (so no networking/ip changes.)

The work is scheduled for Thursday, February 28th 2 1500 UTC. At that time, Chris will begin taking down servers one at a time and relocating them into their new rack/network port, and then powering them back online. Ideally, a service owner for each system type would be online and available via IRC to coordinate the downtime for their individual servers, and bring them back into service post-migration.

I'll list off each system that is affected, and our current understanding of how to handle the migration via task updates thus far.

If there are any changes required, PLEASE UPDATE THE TASK DIRECTLY RATHER THAN REPLY TO THIS EMAIL.

Systems moving out of A5:

Systems Affected : system role : owner : how to handle the migration

sodium: mirrors server : faidon : clean shutdown and migrate to another public vlan 10G port in row a
cloudelastic1001 : cloud elastic system : service operations : current role spare as its not yet in service, just clean shutdown and relocation into another 10G port in public vlan
cp1076 : cp system : traffic : clean shutdown will automatically depool, relocate to new 10G port in inernal vlan row a
cp1078 : cp system : traffic : clean shutdown will automatically depool, relocate to new 10G port in inernal vlan row a
ms-be1040 : ms backend : filippo : clean shutdown and relocation, bring back online in 10G row A internal vlan
ms-be1028 : ms backend : filippo : clean shutdown and relocation, bring back online in 10G row A internal vlan
ms-be1029 : ms backend : filippo : clean shutdown and relocation, bring back online in 10G row A internal vlan
ms-be1030 : ms backend : filippo : clean shutdown and relocation, bring back online in 10G row A internal vlan
ms-fe1005 : ms frontend : filippo : clean shutdown and relocation, bring back online in 10G row A internal vlan AND DO NOT TAKE DOWN AT THE SAME TIME AS ms-fe1006, bring fully back online before moving other ms-fe system
ms-fe1005 : ms frontend : filippo : clean shutdown and relocation, bring back online in 10G row A internal vlan AND DO NOT TAKE DOWN AT THE SAME TIME AS ms-fe1005, bring fully back online before moving other ms-fe system

Systems moving into A5:

Systems Affected : system role : owner : how to handle the migration

wtp1028-1030 : parsoid : Alex (since he implemented them last on SRE team? : uncertain at this time, need feedback. Task has old listing of 'can move anytime' and we assume a clean shutdown and power up is all that is needed. NEED CONFIRMATION.
analytics1071 : analytics machine : elukey & ottomata : Luca or Andrew will need to merge some patches to facilitate the move, and plan to be around during the migration.
ores1002 : ores machine : Alex : notes state clean shutdown and can move anytime, but we'll need confirmation from Alex. NEED CONFIRMATION.
ganeti1008 : ganeti vm host : Alex : empty the nodes from live VMs first and migrate, this is documented on wikitech (sync with Alex if possible to confirm all is still fine for this)

So, the list in my email is too long, and some of those hosts were previously moved by Chris in advance of my email (likely well in advance, awhile ago, via independent projects.)

The actual migration list is that much shorter. I've put it on this google sheet, which is ONLY accessible by SRE members (so don't bother requesting access if you aren't working on this project.) This sheet lists every host to move, its current location, and its future projected location. If the host is moving out of a rack that IS NOT a5-eqiad, I'm also listing its old port info for future removal post migration. (Servers moving off asw2-a5-eqiad wont need ports disabled, since that switch is being removed.)

Ok, @ayonsi double checked my ports and migration plan on the gsheet, and I've started to make the needed port changes (he setup ganeti1008 for me as its a special case.)

robh@asw2-a-eqiad# show | compare 
[edit interfaces interface-range vlan-private1-a-eqiad]
     member ge-3/0/36 { ... }
+    member ge-5/0/40;
+    member ge-5/0/41;
+    member ge-5/0/42;
+    member ge-5/0/44;
[edit interfaces interface-range vlan-analytics1-a-eqiad]
     member xe-7/0/33 { ... }
+    member ge-5/0/43;
[edit interfaces]
+   ge-5/0/40 {
+       description wtp1028;
+   }
+   ge-5/0/41 {
+       description wtp1029;
+   }
+   ge-5/0/42 {
+       description wtp1030;
+   }
+   ge-5/0/43 {
+       description analytics1071;
+   }
+   ge-5/0/44 {
+       description ores1002;
+   }

So I neglected to remove those above from disabled group, did so in next update and removed all the others that were also in disabled but need to be used for this:

robh@asw2-a-eqiad# show | compare 
[edit interfaces interface-range disabled]
-    member ge-5/0/41;
-    member ge-5/0/42;
-    member ge-5/0/43;
-    member ge-5/0/40;
-    member ge-5/0/44;
[edit interfaces interface-range vlan-private1-a-eqiad]
     member ge-5/0/44 { ... }
+    member xe-2/0/10;
+    member xe-2/0/15;
+    member xe-2/0/24;
+    member xe-2/0/25;
+    member xe-4/0/28;
[edit interfaces interface-range vlan-public-a-eqiad]
     member ge-1/0/8 { ... }
+    member xe-7/0/37;
[edit interfaces]
+   xe-2/0/10 {
+       description ms-be1028;
+   }
+   xe-2/0/15 {
+       description ms-be1029;
+   }
+   xe-2/0/24 {
+       description ms-be1030;
+   }
+   xe-2/0/25 {
+       description ms-fe1005;
+   }
+   xe-4/0/28 {
+       description ms-fe1006;
+   }
+   xe-7/0/37 {
+       description sodium;
+   }

{master:7}[edit]
robh@asw2-a-eqiad#

I thought it was odd some of them that I ran the command to remove from disabled, turned out to not be in use, but not disabled, like xe-7/0/37. (I had to set a description, it had none set and thus not used, but wasnt in any vlan or disabled.)

All should function now.

I thought it was odd some of them that I ran the command to remove from disabled, turned out to not be in use, but not disabled, like xe-7/0/37. (I had to set a description, it had none set and thus not used, but wasnt in any vlan or disabled.)

This is because 10G switch's interfaces only exist when an optic is present. So as they don't exist on the switch's point of view, they don't need to be disabled.
This can be seen with:

show interfaces xe-7/0/37 
error: device xe-7/0/37 not found

It's perfectly fine to pre-configure them, and once an optic is there, they do need to be disabled if they are not in use.

Change 493169 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop analytics: set analytics1071 rack config to A5

https://gerrit.wikimedia.org/r/493169

We have an issue with these servers. I completely forgot about the lack of power availability in rack A2. We will need to move these to a different row or wait until we swap the PDUs?

xe-2/0/10 {
+ description ms-be1028;
+ }
+ xe-2/0/15 {
+ description ms-be1029;
+ }
+ xe-2/0/24 {
+ description ms-be1030;
+ }
+ xe-2/0/25 {
+ description ms-fe1005;

Mentioned in SAL (#wikimedia-operations) [2019-02-27T17:06:27Z] <cmjohnson1> powering off wtp1029 to move to different rack A5 T212348

Mentioned in SAL (#wikimedia-operations) [2019-02-27T17:14:30Z] <cmjohnson1> powering off wtp1029 to move to different rack A5 T212348

Mentioned in SAL (#wikimedia-operations) [2019-02-27T17:19:27Z] <cmjohnson1> powering off wtp1030 to move to different rack A5 T212348

Mentioned in SAL (#wikimedia-operations) [2019-02-27T17:22:54Z] <elukey> drain + shutdown of analytics1071 to allow its move to A5 - T212348

We have an issue with these servers. I completely forgot about the lack of power availability in rack A2. We will need to move these to a different row or wait until we swap the PDUs?

xe-2/0/10 {
+ description ms-be1028;
+ }
+ xe-2/0/15 {
+ description ms-be1029;
+ }
+ xe-2/0/24 {
+ description ms-be1030;
+ }
+ xe-2/0/25 {
+ description ms-fe1005;

Ok, removed those from A2 and will add them into A7 instead!

[edit interfaces interface-range disabled]
     member ge-5/0/14 { ... }
+    member xe-2/0/10;
+    member xe-2/0/15;
+    member xe-2/0/24;
+    member xe-2/0/25;
[edit interfaces interface-range vlan-private1-a-eqiad]
-    member xe-2/0/10;
-    member xe-2/0/15;
-    member xe-2/0/24;
-    member xe-2/0/25;
[edit interfaces]
-   xe-2/0/10 {
-       description ms-be1028;
-   }
-   xe-2/0/15 {
-       description ms-be1029;
-   }
-   xe-2/0/24 {
-       description ms-be1030;
-   }
-   xe-2/0/25 {
-       description ms-fe1005;
-   }

Setup new ports for the mw systems in a5:

[edit interfaces interface-range disabled]
-    member ge-5/0/1;
-    member ge-5/0/0;
-    member ge-5/0/2;
-    member ge-5/0/3;
-    member ge-5/0/6;
-    member ge-5/0/8;
[edit interfaces interface-range vlan-private1-a-eqiad]
     member xe-4/0/28 { ... }
+    member ge-5/0/0;
+    member ge-5/0/1;
+    member ge-5/0/2;
+    member ge-5/0/3;
+    member ge-5/0/6;
+    member ge-5/0/8;
[edit interfaces]
+   ge-5/0/0 {
+       description mw1261;
+   }
+   ge-5/0/1 {
+       description mw1262;
+   }
+   ge-5/0/2 {
+       description mw1263;
+   }
+   ge-5/0/3 {
+       description mw1264;
+   }
+   ge-5/0/6 {
+       description mw1265;
+   }
+   ge-5/0/8 {
+       description mw1266;
+   }

{master:7}[edit]

google sheet is also updated for migration plan

Change 493169 merged by Elukey:
[operations/puppet@production] hadoop analytics: set analytics1071 rack config to A5

https://gerrit.wikimedia.org/r/493169

ports in a7 for ms-be and ms-fe setup, using ports that were for old mw systems on old switch stack and will be unused on new stack:

robh@asw2-a-eqiad# show | compare 
[edit interfaces interface-range vlan-private1-a-eqiad]
     member ge-5/0/8 { ... }
+    member xe-7/0/0;
+    member xe-7/0/1;
+    member xe-7/0/2;
+    member xe-7/0/3;
[edit interfaces]
+   xe-7/0/0 {
+       description ms-be1028;
+   }
+   xe-7/0/1 {
+       description ms-be1029;
+   }
+   xe-7/0/2 {
+       description ms-be1030;
+   }
+   xe-7/0/3 {
+       description ms-fe1005;
+   }

they weren't setup at all before (optic ports) so had nothing to remove from the disabled group.

Mentioned in SAL (#wikimedia-operations) [2019-02-27T18:15:09Z] <cmjohnson1> powering off mw1261 to move to different rack A5 T212348

Mentioned in SAL (#wikimedia-operations) [2019-02-27T18:21:23Z] <cmjohnson1> powering off mw1262 to move to different rack A5 T212348

Mentioned in SAL (#wikimedia-operations) [2019-02-27T18:28:16Z] <cmjohnson1> powering off mw126[3-6] one at a time to move to different rack A5 T212348

Mentioned in SAL (#wikimedia-operations) [2019-02-28T17:24:07Z] <cmjohnson1> powering down sodium to move racks T212348

Everything has been moved smoothly, thanks!