Page MenuHomePhabricator

Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it
Closed, ResolvedPublic

Description

We now have a new QFX5120-48Y switch in codfw rack B1, which arrived with the replacement switches for the row A/B hardware refresh recently (see T312138). It is installed in the rack and reachable via the OpenGear console server.

The intention is to configure this switch, unlike asw-b1-codfw which is was purchased to replace, as a stand-alone "cloudsw", mirroring the dedicated switches for WMCS in eqiad.

Creating this task to track the steps to configure this device, and begin migrating cloud hosts from the existing one over to it.

At a high-level I would suggest we proceed as follows, open to discussion of course:

Add Physical Connections

  1. Connect the cloudsw to cr1-codfw or cr2-codfw for the routed uplink from the cloudsw to the core routers.
    1. I expect 10G is sufficient bandwidth for this link
    2. 1 connection is probably sufficient, this does mean lack of redundancy but codfw is WMCS test/staging site.
  2. Connect the cloudsw to asw-b1-codfw, as a vlan trunk port
    1. We only trunk Vlan 2118 - cloud-hosts1-codfw - over this link
    2. 10G probably sufficient BW. Could be 2x10G LAG if we thought needed?

Enable the routed CR -> Cloudsw logical links and BGP

Once the physical connections are in place we proceed like this to make the cloudsw <-> cr link live:

  1. Configure the cloudsw similarly to those in eqiad, using the same templates, vrf setup
  2. Configure the routed uplinks on the CR and cloudsw, and apply the labs-in and cloud-in filters to sub-ints
  3. Configure the cloudsw with a currently unused IP from the cloud-hosts1-codfw subnet
  4. Validate that the CR receives the BGP announcement of the cloud-hosts1-codfw subnet from the cloudsw
    1. It will still prefer it's direct connection to it on ae2.2118

Move cloud vlan gateway IPs from CRs to cloudsw

For cloud-hosts1-codfw subnet:

  1. Change GW IP on cloudsw irb.2118 interface to 10.192.20.1
  2. Shut down ae2.2118 on cr1-codfw and cr2-codfw
    1. This will halt traffic as hosts have cached MAC VRRP MAC in their ARP table
    2. We need to manually clear the ARP cache on servers connected to cloud-hosts1-codfw for the GW IP
  3. Validate things are working as before, all services etc., and traffic flowing via the cloudsw<-->cr link

TODO - Add section on moving Vlan 2120 (cloud-instance-transport1-b-codfw) to the cloudsw using similar process.

Begin physical host moves, and CloudLB POC

With the gateway for hosts on cloud Vlans now moved over to the new switch we can then begin to migrate host physical connections to the new it. We can also add the cloud-private Vlan as discussed here, which is needed to begin work on the CloudLB POC (see T324992).

Public Vlan

Vlan 2002 (public1-b-codfw) will not be trunked to the cloudsw as part of this move, so hosts connected to that (for instance cloudservice), should be left connected to asw-b1-codfw for now.

Ultimately the plan would be to validate the design for the CloudLB, and then migrate these hosts to that new model, moving them to the new switch in the process. But leaving them connected to the old switch for as long as they have to be on public1-b-codfw.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 896200 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Restrict prefix length for public announce, allow bgp for cloud range

https://gerrit.wikimedia.org/r/896200

Change 896200 merged by jenkins-bot:

[operations/homer/public@master] Restrict prefix length for public announce, allow bgp for cloud range

https://gerrit.wikimedia.org/r/896200

I've added the BGP config and moved the GW interfaces from the two CRs in codfw to the cloudsw.

Things look ok on both counts, hosts can reach resources outside their own subnets / cloud network(s):

cmooney@cloudgw2003-dev:~$ sudo traceroute -I -w 1 elastic1057.eqiad.wmnet 
traceroute to elastic1057.eqiad.wmnet (10.64.32.93), 30 hops max, 60 byte packets
 1  irb-2118.cloudsw1-b1-codfw.codfw.wmnet (10.192.20.1)  1.030 ms  1.006 ms  1.000 ms
 2  xe-1-0-4-0-1000.cr1-codfw.wikimedia.org (10.192.254.0)  0.727 ms  0.765 ms  0.762 ms
 3  xe-4-2-0.cr1-eqiad.wikimedia.org (208.80.153.220)  31.097 ms  31.093 ms  31.089 ms
 4  elastic1057.eqiad.wmnet (10.64.32.93)  31.639 ms * *
cmooney@cloudgw2003-dev:~$ sudo traceroute6 -I -w 1 elastic1057.eqiad.wmnet 
traceroute to elastic1057.eqiad.wmnet (2620:0:861:103:10:64:32:93), 30 hops max, 80 byte packets
 1  irb-2118.cloudsw1-b1-codfw.codfw.wmnet (2620:0:860:118::1)  0.632 ms  0.596 ms  0.589 ms
 2  xe-1-0-4-0-1000.cr1-codfw.wikimedia.org (2620:0:860:130::1)  0.496 ms  0.531 ms  0.527 ms
 3  xe-4-2-0.cr1-eqiad.wikimedia.org (2620:0:860:fe01::1)  30.224 ms  30.220 ms  30.279 ms
 4  elastic1057.eqiad.wmnet (2620:0:861:103:10:64:32:93)  31.676 ms * *
cmooney@cloudgw2003-dev:~$ sudo ip vrf exec vrf-cloudgw traceroute -I -n -w 1 -s 185.15.57.9 1.1.1.1
traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
 1  208.80.153.185  4.638 ms  4.598 ms  4.591 ms
 2  208.80.153.176  0.498 ms  0.494 ms  0.527 ms
 3  208.80.153.219  0.369 ms  0.366 ms  0.399 ms
 4  * * *
 5  172.71.168.2  2.376 ms  2.373 ms  2.367 ms
 6  1.1.1.1  1.603 ms  1.622 ms  1.614 ms
root@uk:~# mtr -z -b -w -c 10 185.15.57.9
Start: 2023-03-10T00:58:13+0100
HOST: uk.rankinrez.net                                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS46261  91.132.85.1                                                        0.0%    10    1.1   1.2   1.0   1.4   0.1
  2. AS20860  185.91.76.101                                                      0.0%    10    0.6   0.7   0.5   1.4   0.3
  3. AS20860  be6.3222.asr01.dc13.as20860.net (130.180.203.223)                  0.0%    10    6.4   6.4   6.1   7.9   0.5
  4. AS20860  be16.asr01.ld5.as20860.net (130.180.202.0)                         0.0%    10    6.5   6.7   6.5   6.9   0.1
  5. AS6461   ae7.mpr1.lhr23.uk.zip.zayo.com (94.31.48.81)                       0.0%    10    7.0   6.8   5.9  10.9   1.5
  6. AS6461   ae11.cs1.lhr15.uk.zip.zayo.com (64.125.28.32)                     70.0%    10    7.0   7.2   7.0   7.3   0.1
  7. AS6461   ae3.mpr1.lhr15.uk.zip.zayo.com (64.125.28.151)                     0.0%    10    6.8   7.4   6.8   9.1   0.9
  8. AS1299   ldn-b1-link.ip.twelve99.net (62.115.59.45)                         0.0%    10    7.3   7.4   7.3   7.5   0.1
  9. AS1299   ldn-bb4-link.ip.twelve99.net (62.115.143.26)                       0.0%    10    9.0   9.1   7.2  14.0   2.2
 10. AS1299   nyk-bb1-link.ip.twelve99.net (62.115.112.244)                      0.0%    10   75.2  75.0  74.8  75.2   0.1
 11. AS1299   rest-bb1-link.ip.twelve99.net (62.115.141.244)                     0.0%    10   81.3  81.5  81.3  81.7   0.1
 12. AS1299   atl-bb1-link.ip.twelve99.net (62.115.138.71)                      70.0%    10   95.6  95.6  95.5  95.7   0.1
 13. AS???    ???                                                               100.0    10    0.0   0.0   0.0   0.0   0.0
 14. AS1299   dls-bb1-link.ip.twelve99.net (62.115.137.45)                      80.0%    10  114.2 114.3 114.2 114.4   0.2
 15. AS1299   dls-b24-link.ip.twelve99.net (62.115.136.83)                       0.0%    10  114.5 114.5 114.2 115.1   0.3
 16. AS1299   dls-b4-link.ip.twelve99.net (62.115.113.117)                       0.0%    10  114.9 115.1 114.8 115.5   0.2
 17. AS1299   wikimedia-ic308846-dls-b22.ip.twelve99-cust.net (80.239.192.102)   0.0%    10  114.7 118.8 114.6 155.3  12.8
 18. AS14907  xe-0-0-46-1001.cloudsw1-b1-codfw.wikimedia.org (208.80.153.177)    0.0%    10  120.7 119.9 116.2 129.7   3.8
 19. AS14907  virt.cloudgw.codfw1dev.wikimediacloud.org (185.15.57.9)            0.0%    10  114.7 114.9 114.6 117.9   1.0

Change 896310 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Allow cloudsw in codfw to announce 208.80.153.184/29

https://gerrit.wikimedia.org/r/896310

Change 896310 merged by jenkins-bot:

[operations/homer/public@master] Allow cloudsw in codfw to announce 208.80.153.184/29

https://gerrit.wikimedia.org/r/896310

Change 896329 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for new private IP ranges in use in codfw

https://gerrit.wikimedia.org/r/896329

Change 896331 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add Homer device entry for cloudsw1-b-codfw

https://gerrit.wikimedia.org/r/896331

Change 896333 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Puppet additions for new network device cloudsw1-b1-codfw

https://gerrit.wikimedia.org/r/896333

Change 896331 merged by jenkins-bot:

[operations/homer/public@master] Add Homer device entry for cloudsw1-b-codfw

https://gerrit.wikimedia.org/r/896331

Mentioned in SAL (#wikimedia-operations) [2023-03-10T12:49:04Z] <cmooney@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync data for new cloudsw1-b1-codfw device. - cmooney@cumin1001 - T327919"

Mentioned in SAL (#wikimedia-operations) [2023-03-10T12:50:57Z] <cmooney@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync data for new cloudsw1-b1-codfw device. - cmooney@cumin1001 - T327919"

Change 896333 merged by Cathal Mooney:

[operations/puppet@production] Puppet additions for new network device cloudsw1-b1-codfw

https://gerrit.wikimedia.org/r/896333

Change 896350 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Update switch_loopback prefix lists to include assigned codfw ranges

https://gerrit.wikimedia.org/r/896350

Change 896350 merged by jenkins-bot:

[operations/homer/public@master] Update switch_loopback prefix lists to include assigned codfw ranges

https://gerrit.wikimedia.org/r/896350

Change 896329 merged by Cathal Mooney:

[operations/dns@master] Add includes for new private IP ranges in use in codfw

https://gerrit.wikimedia.org/r/896329

We've tested PXEboot / DHCP / OS install for 2 new servers (cloudlb2002-dev and cloudlb2003-dev) on this switch now and it's working ok.

So all now looks ok in terms of this switch, it has been added to monitoring and all is green (apart from system alarm warning which relates to missing BGP license, which we've been chasing our supplier about for some time, we paid the insane price for it already).

Next step is to plan the migration of the existing hosts with WMCS and DC-Ops.

I noticed that it's running 19.1R3-S2.3 we should upgrade it to latest Junos recommended before we move more servers to it.

I noticed that it's running 19.1R3-S2.3 we should upgrade it to latest Junos recommended before we move more servers to it.

Yes @ayounsi absolutely we should yes, should have mentioned above.

Change 898678 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Modify policy to use in aggregate for 185.15.57.0/24 in codfw

https://gerrit.wikimedia.org/r/898678

Change 898684 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Move cloudsw prefix-list filters from templates to YAML

https://gerrit.wikimedia.org/r/898684

Change 898678 merged by jenkins-bot:

[operations/homer/public@master] Modify policy to use in aggregate for 185.15.57.0/24 in codfw

https://gerrit.wikimedia.org/r/898678

Icinga downtime and Alertmanager silence (ID=484494a0-6cb6-4421-b546-9b17aa96a3a6) set by cmooney@cumin1001 for 0:30:00 on 3 host(s) and their services with reason: cloudsw1-b1-codfw OS upgrade

cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2023-03-14T19:47:17Z] <topranks> Reboot cloudsw1-b1-codfw to upgrade JunOS version T327919

@Papaul, I suggest we move the ports from the existing switch to the new switch in 3 batches. The cloudcephosd hosts we need to have at least 2 working at all times, so we will do 1 of these in each batch to avoid downtime.

Batch 1:

HostExisting portNew Port
cloudcephosd2001-devasw-b1-codfw ge-1/0/4cloudsw1-b1-codfw ge-0/0/4
cloudcephosd2001-devasw-b1-codfw ge-1/0/5cloudsw1-b1-codfw ge-0/0/5
cloudvirt2001-devasw-b1-codfw ge-1/0/6cloudsw1-b1-codfw ge-0/0/6
cloudcephmon2004-devasw-b1-codfw ge-1/0/9cloudsw1-b1-codfw ge-0/0/9
cloudvirt2003-devasw-b1-codfw ge-1/0/10cloudsw1-b1-codfw ge-0/0/10
cloudlb2001-devasw-b1-codfw ge-1/0/11cloudsw1-b1-codfw ge-0/0/11
cloudcontrol2005-devasw-b1-codfw ge-1/0/14cloudsw1-b1-codfw ge-0/0/14
clouddb2002-devasw-b1-codfw ge-1/0/15cloudsw1-b1-codfw ge-0/0/15

Batch 2:

HostExisting portNew Port
cloudcephosd2003-devasw-b1-codfw ge-1/0/7cloudsw1-b1-codfw ge-0/0/7
cloudcephosd2003-devasw-b1-codfw ge-1/0/8cloudsw1-b1-codfw ge-0/0/8
cloudgw2003-devasw-b1-codfw ge-1/0/16cloudsw1-b1-codfw ge-0/0/16
cloudvirt2002-devasw-b1-codfw ge-1/0/2cloudsw1-b1-codfw ge-0/0/17
cloudcontrol2001-devasw-b1-codfw ge-1/0/18cloudsw1-b1-codfw ge-0/0/18
cloudgw2002-devasw-b1-codfw ge-1/0/19cloudsw1-b1-codfw ge-0/0/19
cloudcephmon2005-devasw-b1-codfw ge-1/0/21cloudsw1-b1-codfw ge-0/0/21

Batch 3:

HostExisting portNew Port
cloudcephosd2002-devasw-b1-codfw ge-1/0/12cloudsw1-b1-codfw ge-0/0/12
cloudcephosd2002-devasw-b1-codfw ge-1/0/0cloudsw1-b1-codfw ge-0/0/13
cloudcephmon2006-devasw-b1-codfw ge-1/0/22cloudsw1-b1-codfw ge-0/0/22
cloudnet2005-devasw-b1-codfw ge-1/0/23cloudsw1-b1-codfw ge-0/0/23
cloudnet2006-devasw-b1-codfw ge-1/0/25cloudsw1-b1-codfw ge-0/0/25
cloudservices2004-devasw-b1-codfw ge-1/0/28cloudsw1-b1-codfw ge-0/0/36
cloudservices2005-devasw-b1-codfw ge-1/0/29cloudsw1-b1-codfw ge-0/0/37
cloudweb2002-devasw-b1-codfw ge-1/0/30cloudsw1-b1-codfw ge-0/0/38

I've highlighted in bold the ones where we are using a different port number on the new switch. That can't be avoided for some of them as the equivalent port on the new one is in use, or the block it comes from is running at 10G.

In terms of doing the work we can do it all on the same day, one batch after another. Basic approach will be you start doing the cables for a batch, I will change Netbox config for the switch ports, and run Homer to update. Then when we're both done we check all looks ok with the connections to new switches, and if so move on to next batch.

Change 898684 merged by jenkins-bot:

[operations/homer/public@master] Move cloudsw prefix-list filters from templates to YAML

https://gerrit.wikimedia.org/r/898684

In terms of the move we need to work with @aborrero and the team to decide when is good to do the work. We can do it in a number of batches or all in one go, whatever you guys think is best. I can move the interfaces in Netbox and configure the new switch in advance in either case.

All hosts can be done anytime. For cloudceph* nodes in particular, I'll double check with @dcaro in case he wants to stay on top of it.

Change 900264 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add protocol direct to Cloud_outfilter protocols

https://gerrit.wikimedia.org/r/900264

Change 900264 merged by Cathal Mooney:

[operations/homer/public@master] Add protocol direct to Cloud_outfilter protocols

https://gerrit.wikimedia.org/r/900264

@cmooney thank you for getting the table ready for the cloud nodes move. As you can see on asw-b1 and in netbox i tried to keep the server racked location to match the server connection on the switch in codfw . For example if a server is racked in U1 it will use [ge-xe]-0/0/0 on the switch.

I know the first 4 ports on cloudsw1-b1 are set up as 10G because we have sretest2001 connected to the first 2 port we can unracked that server since it was a test server and set those interfaces to 1G and have cloudvirt2002-dev and cloudvirt2002-dev keep their same interfaces as on the old switch.

For cloudservices2004-dev, cloudservices2005-dev and cloudweb2002-dev I can relocate them to U35,U36 and U37.

This will be valid also for all the other racks in the row A/row B refresh.

Also since we move the switch at U48 I already requested some cables to connect the switch to server that at racked from U1 to U10 since I need long cables.

Thank you.

I know the first 4 ports on cloudsw1-b1 are set up as 10G because we have sretest2001 connected to the first 2 port we can unracked that server since it was a test server and set those interfaces to 1G and have cloudvirt2002-dev and cloudvirt2002-dev keep their same interfaces as on the old switch.

Yep that works fine. I'll delete the connection from sretest to the switch and remove the interfaces in Netbox.

For cloudservices2004-dev, cloudservices2005-dev and cloudweb2002-dev I can relocate them to U35,U36 and U37.

That works apart from cloudservices2004-dev. Port 35 is part of block 32, and is at 10G because xe-0/0/32 is connecting cloudlb2003-dev.

This will be valid also for all the other racks in the row A/row B refresh.

While I think this is a nice approach, I am kind of thinking it will be a lot more trouble to try and maintain longer term. Managing what blocks of ports are at 1/10/25 is difficult enough, but if that also translates to where the servers need to go in the rack it could get complicated. I can imagine us needing to move a server in the rack to move from 1 to 10G for instance, which seems like a lot of trouble to me. But your call at the end of the day.

Also since we move the switch at U48 I already requested some cables to connect the switch to server that at racked from U1 to U10 since I need long cables.

Ok cool, so we need to wait until we get those cables? Are they ordered do you know?

Cheers.

cloudservices2004-dev = U37
cloudservices2005-dev = U38
cloudweb2002-dev = 39

yes we already order the cable arriving sometimes in April but we don't have to wait there are some servers that we can move. I will get you the list of those sometimes next week if possible.

To keep the block of 1G 10G on the switch what I can suggest will be to configure from left to right block of 1G and from right to left block of 10G. For the racking part leave it to me.Also We onsite we will be setting those servers and doing the switch configuration so I can take care of where the server will be and on witch switch port.

Ok cool. So I'd propose we take it like this:

1. Move sretest2001 from port xe-0/0/1 to xe-0/0/45

This can be done anytime, just let me know when done I'll change netbox.

2. First batch of cloud servers:

We can do these now without waiting on the longer cables to arrive:

HostExisting portNew Port
cloudlb2001-devasw-b1-codfw ge-1/0/11cloudsw1-b1-codfw ge-0/0/11
clouddb2002-devasw-b1-codfw ge-1/0/15cloudsw1-b1-codfw ge-0/0/15
cloudgw2003-devasw-b1-codfw ge-1/0/16cloudsw1-b1-codfw ge-0/0/16
cloudgw2002-devasw-b1-codfw ge-1/0/19cloudsw1-b1-codfw ge-0/0/19
cloudcephmon2005-devasw-b1-codfw ge-1/0/21cloudsw1-b1-codfw ge-0/0/21
cloudcephmon2006-devasw-b1-codfw ge-1/0/22cloudsw1-b1-codfw ge-0/0/22
cloudnet2005-devasw-b1-codfw ge-1/0/23cloudsw1-b1-codfw ge-0/0/23
cloudnet2006-devasw-b1-codfw ge-1/0/25cloudsw1-b1-codfw ge-0/0/25

3. Second batch of cloud servers:

These need to wait until the cables arrive, we still will do it in two batches after that as we need to move the cloudceph one by one.

HostExisting portNew Port
cloudcephosd2002-devasw-b1-codfw ge-1/0/0cloudsw1-b1-codfw ge-0/0/0
cloudvirt2002-devasw-b1-codfw ge-1/0/2cloudsw1-b1-codfw ge-0/0/2
cloudvirt2001-devasw-b1-codfw ge-1/0/6cloudsw1-b1-codfw ge-0/0/6
cloudcephosd2002-devasw-b1-codfw ge-1/0/12cloudsw1-b1-codfw ge-0/0/12

4. Third batch of cloud servers

HostExisting portNew Port
cloudcephosd2001-devasw-b1-codfw ge-1/0/4cloudsw1-b1-codfw ge-0/0/4
cloudcephosd2001-devasw-b1-codfw ge-1/0/5cloudsw1-b1-codfw ge-0/0/5

5. Forth and last batch:

cloudcephosd2003-devasw-b1-codfw ge-1/0/7cloudsw1-b1-codfw ge-0/0/7
cloudcephosd2003-devasw-b1-codfw ge-1/0/8cloudsw1-b1-codfw ge-0/0/8
cloudcephmon2004-devasw-b1-codfw ge-1/0/9cloudsw1-b1-codfw ge-0/0/9
cloudvirt2003-devasw-b1-codfw ge-1/0/10cloudsw1-b1-codfw ge-0/0/10

To keep the block of 1G 10G on the switch what I can suggest will be to configure from left to right block of 1G and from right to left block of 10G. For the racking part leave it to me.Also We onsite we will be setting those servers and doing the switch configuration so I can take care of where the server will be and on witch switch port.

The left to right order isn't a bad idea. Ultimately your call so I'm happy if you want to stick to the pattern.

Lastly I'd included these in my previous list, but in fact they are connected to the codfw public-b vlan right now. We have a few options in terms of that, but right now I don't want to move them / bridge the public vlan onto the new switch, so let's not move them for now.

cloudcontrol2001-dev
cloudcontrol2005-dev
cloudservices2004-dev
cloudservices2005-dev
cloudweb2002-dev

Change 900448 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add cloudsw1-b1-codfw to Rancid

https://gerrit.wikimedia.org/r/900448

Change 900448 merged by Cathal Mooney:

[operations/puppet@production] Add cloudsw1-b1-codfw to Rancid

https://gerrit.wikimedia.org/r/900448

@cmooney We can move any servers racked from U11 up

@cmooney We can move any servers racked from U11 up

I'm not sure it's worth going to the trouble for the sake of a few weeks.

If we can do the first batch next week we should be able to continue with the tests on cloudlb which is the main thing hanging on this, we can move the rest when the cables arrive.

@cmooney Please see first batch proposal. We can move all those servers next week. @aborrero can you please let us know when will be the best day and time next week for of to move those servers? Thank you

HostU spaceExisting portNew Port
cloudlb2001-dev15asw-b1-codfw ge-1/0/11cloudsw1-b1-codfw ge-0/0/14
clouddb2002-dev17asw-b1-codfw ge-1/0/15cloudsw1-b1-codfw ge-0/0/16
cloudgw2003-dev18asw-b1-codfw ge-1/0/16cloudsw1-b1-codfw ge-0/0/17
cloudgw2002-dev7asw-b1-codfw ge-1/0/19cloudsw1-b1-codfw ge-0/0/6
cloudcephmon2005-dev8asw-b1-codfw ge-1/0/21cloudsw1-b1-codfw ge-0/0/7
cloudcephmon2006-dev9asw-b1-codfw ge-1/0/22cloudsw1-b1-codfw ge-0/0/8
cloudnet2005-dev10asw-b1-codfw ge-1/0/23cloudsw1-b1-codfw ge-0/0/9
cloudnet2006-dev11asw-b1-codfw ge-1/0/25cloudsw1-b1-codfw ge-0/0/10

@cmooney Please see first batch proposal. We can move all those servers next week. @aborrero can you please let us know when will be the best day and time next week for of to move those servers? Thank you

Can be done anytime, thanks!

@cmooney Please see first batch proposal. We can move all those servers next week. @aborrero can you please let us know when will be the best day and time next week for of to move those servers? Thank you

Can be done anytime, thanks!

Thanks Papaul for the update looks good.

I'd say we either do Monday 27th, or Wednesday 29th. Tuesday we best avoid so we can focus on the Eqiad row B switch upgrade (T330165).

We moved the fist batch of servers today all went well.

@cmooney second batch proposal below

HostU spaceExisting portNew port
cloudcephosd2002-dev1asw-b1-codfw ge-1/0/0cloudsw1-b1-codfw ge-0/0/0
cloudcephosd2002-dev1asw-b1-codfw ge-1/0/12cloudsw1-b1-codfw ge-0/0/22
cloudvirt2002-dev2asw-b1-codfw ge-1/0/2cloudsw1-b1-codfw ge-0/0/1
cloudvirt2001-dev4asw-b1-codfw ge-1/0/6cloudsw1-b1-codfw ge-0/0/3

@Papaul looks good to me. I can do them any day this week except today (Tuesday), so whenever you are happy.

@aborrero are we ok to proceed with theis second batch of 4 any time also?

@aborrero are we ok to proceed with theis second batch of 4 any time also?

Yes! thanks

@cmooney can we do this on Thursday ? Can we also do the other batches(3-4) on the same day?

@cmooney can we do this on Thursday ? Can we also do the other batches(3-4) on the same day?

Yeah that works from my point of view. We'll double check with the cloud team things look ok between each.

Just noticed what has been probably in the radar for @cmooney for some time now: cloudcontrol2004-dev (currently D1) needs to be relocated (reracked) into B1.

@aborrero cloudcontrol2004-dev is in a public VLAN that is what we didn't relocate it in B1. But if there is something that did change on the server i will be happy to relocate it.

@Papaul we're gonna reimage this one onto new vlans (will happen to all the public vlan ones in time, this is first test).

But that's a good point, @aborrero I think we'll need to follow this process:

https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs

You were going to reimage anyway right? So we want to complete those first few steps (decommision essentially), then we can safely move rack and follow the remaining steps to reimage onto the new vlan(s).

Sounds good to me. This is what we need to do with cloudcontrol2004-dev:

Third batch

HostU spaceExisting portNew port
cloudcephosd2001-dev3asw-b1-codfw ge-1/0/4cloudsw1-b1-codfw ge-0/0/2
cloudcephosd2001-dev3asw-b1-codfw ge-1/0/5cloudsw1-b1-codfw ge-0/0/23

Forth and last batch

HostU spaceExisting portNew port
cloudcephosd2003-dev5asw-b1-codfw ge-1/0/7cloudsw1-b1-codfw ge-0/0/4
cloudcephosd2003-dev5asw-b1-codfw ge-1/0/8cloudsw1-b1-codfw ge-0/0/24
cloudcephmon2004-dev20asw-b1-codfw ge-1/0/9cloudsw1-b1-codfw ge-0/0/19
cloudvirt2003-dev6asw-b1-codfw ge-1/0/10cloudsw1-b1-codfw ge-0/0/5

All remaining (non public-vlan) hosts have been moved and look good to me (reachable, MAC addresses the same etc, ceph cluster health 'ok).

Thanks @Papaul for doing the work in the DC!

Icinga downtime and Alertmanager silence (ID=744d6bf2-4472-4a4c-b0a2-ebf0e4e9d466) set by cmooney@cumin1001 for 0:30:00 on 3 host(s) and their services with reason: cloudsw1-b1-codfw OS upgrade

cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt

@Papaul when you are back can you advise on the status of these? They all appear as connected on asw-b1-codfw in Netbox, not sure if they are still cabled up or we should delete those connections in Netbox?

PortConnected ServerServer Port
ge-1/0/3cloudvirt2002-deveno2
ge-1/0/13cloudvirt2003-deveno2
ge-1/0/17cloudgw2003-devSECONDARY
ge-1/0/20cloudgw2002-deveno2
ge-1/0/24cloudnet2005-deveno2
ge-1/0/26cloudnet2006-deveno2

@cmooney all those connections are no longer on the old switch we can delete those. thanks

@Papaul thanks I'll remove them from netbox cheers.

The last host connected to asw-b1-codfw (the prod switch) is cloudweb2002-dev (https://netbox.wikimedia.org/dcim/devices/1399/interfaces/) I think for its prod public IP.
@aborrero Is there a task/plan about removing that link? That would allow us to decom that switch and clear some snowflakes/alerts.

The last host connected to asw-b1-codfw (the prod switch) is cloudweb2002-dev (https://netbox.wikimedia.org/dcim/devices/1399/interfaces/) I think for its prod public IP.
@aborrero Is there a task/plan about removing that link? That would allow us to decom that switch and clear some snowflakes/alerts.

Myself and Arturo discussed this in some detail last week. This host is not a regular "cloud" host and I don't think we can easily fit it into the new setup for them with the cloud-private network and public IPs announced via BGP.

It currently hosts wikitech (although I think this is just backup), so it is quite an important server. For now it will need to remain on the public vlan, so I think we should arrange to move it to another rack in row B so we can decom asw-b1-codfw. I'll have a chat with @Papaul and see if we can arrange that.

@cmooney I am ok moving the server when it is ready. We can move it to B5/U20 ge-5/0/15. Also I see that asw-b1 does't match Netbox. asw-b1 is showing just 1 interface enable but that is not what Netbox is showing. Who is responsible of doing the Netbox cleanup after moving the servers from asw-b1 to cloudsw1-b1?

ge-1/0/30       up    up   cloudweb2002-dev

@cmooney I am ok moving the server when it is ready. We can move it to B5/U20 ge-5/0/15. Also I see that asw-b1 does't match Netbox. asw-b1 is showing just 1 interface enable but that is not what Netbox is showing. Who is responsible of doing the Netbox cleanup after moving the servers from asw-b1 to cloudsw1-b1?

ge-1/0/30       up    up   cloudweb2002-dev

Thanks @Papaul. I will do the cleanup / change the switch configs when you are ready to move it.

I think Netbox does match the setup right now, the server is actually connected to both switches in that rack:

cmooney@cloudsw1-b1-codfw> show interfaces descriptions | match cloudweb 
ge-0/0/38       up    up   cloudweb2002-dev
cmooney@asw-b-codfw> show interfaces descriptions | match cloudweb 
ge-1/0/30       up    up   cloudweb2002-dev

Our error here was thinking this was a regular cloud host, so we connected it to the cloudsw as well as the asw. But it turns out this is a special case, it can't be moved to cloudsw so let's move it to B5/U20 ge-5/0/15 and go back to one connection to the public vlan. I'll speak to arturo and the team and find out when we can move it / if we need to depool anything first. Thanks.

I think we are ready for this cloudweb2002-dev move today, assuming no IP change, just a poweroff-poweron operation while re-racking. CC @Papaul @Andrew

Mentioned in SAL (#wikimedia-cloud) [2023-07-17T15:10:57Z] <arturo> [codfw1dev] powered off cloudweb2002-dev for reracking (T327919)

Mentioned in SAL (#wikimedia-cloud) [2023-07-17T15:35:16Z] <arturo> [codfw1dev] cloudweb2002-dev up and running after reracking (T327919)