Page MenuHomePhabricator

Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it
Open, In Progress, MediumPublic

Description

We now have a new QFX5120-48Y switch in codfw rack B1, which arrived with the replacement switches for the row A/B hardware refresh recently (see T312138). It is installed in the rack and reachable via the OpenGear console server.

The intention is to configure this switch, unlike asw-b1-codfw which is was purchased to replace, as a stand-alone "cloudsw", mirroring the dedicated switches for WMCS in eqiad.

Creating this task to track the steps to configure this device, and begin migrating cloud hosts from the existing one over to it.

At a high-level I would suggest we proceed as follows, open to discussion of course:

Add Physical Connections

  1. Connect the cloudsw to cr1-codfw or cr2-codfw for the routed uplink from the cloudsw to the core routers.
    1. I expect 10G is sufficient bandwidth for this link
    2. 1 connection is probably sufficient, this does mean lack of redundancy but codfw is WMCS test/staging site.
  2. Connect the cloudsw to asw-b1-codfw, as a vlan trunk port
    1. We only trunk Vlan 2118 - cloud-hosts1-codfw - over this link
    2. 10G probably sufficient BW. Could be 2x10G LAG if we thought needed?

Enable the routed CR -> Cloudsw logical links and BGP

Once the physical connections are in place we proceed like this to make the cloudsw <-> cr link live:

  1. Configure the cloudsw similarly to those in eqiad, using the same templates, vrf setup
  2. Configure the routed uplinks on the CR and cloudsw, and apply the labs-in and cloud-in filters to sub-ints
  3. Configure the cloudsw with a currently unused IP from the cloud-hosts1-codfw subnet
  4. Validate that the CR receives the BGP announcement of the cloud-hosts1-codfw subnet from the cloudsw
    1. It will still prefer it's direct connection to it on ae2.2118

Move cloud vlan gateway IPs from CRs to cloudsw

For cloud-hosts1-codfw subnet:

  1. Change GW IP on cloudsw irb.2118 interface to 10.192.20.1
  2. Shut down ae2.2118 on cr1-codfw and cr2-codfw
    1. This will halt traffic as hosts have cached MAC VRRP MAC in their ARP table
    2. We need to manually clear the ARP cache on servers connected to cloud-hosts1-codfw for the GW IP
  3. Validate things are working as before, all services etc., and traffic flowing via the cloudsw<-->cr link

TODO - Add section on moving Vlan 2120 (cloud-instance-transport1-b-codfw) to the cloudsw using similar process.

Begin physical host moves, and CloudLB POC

With the gateway for hosts on cloud Vlans now moved over to the new switch we can then begin to migrate host physical connections to the new it. We can also add the cloud-private Vlan as discussed here, which is needed to begin work on the CloudLB POC (see T324992).

Public Vlan

Vlan 2002 (public1-b-codfw) will not be trunked to the cloudsw as part of this move, so hosts connected to that (for instance cloudservice), should be left connected to asw-b1-codfw for now.

Ultimately the plan would be to validate the design for the CloudLB, and then migrate these hosts to that new model, moving them to the new switch in the process. But leaving them connected to the old switch for as long as they have to be on public1-b-codfw.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Papaul could you rename (Netbox, label, console, etc) the switch cloudsw1-b1-codfw? For consistency with all the other switches?

@Papaul you can ignore this. I was working on the device in Netbox and made the change.

I've also added the interfaces for CR uplinks:

https://netbox.wikimedia.org/dcim/devices/4567/interfaces/

I'm assuming these links are gonna be SMF / 10GBase-LR but feel free to change that if you use something else.

@cmooney this looks good to me just one question. Is it possible to use xe-0/0/[46-47] for the links to cr* the fiber coming out of the rack from the switch is going out from the fright from the top right side of the rack so it make it will be great to have both fiber on the same side.
Thanks

I don't have any issue with that. Cabling is at your discretion.

@cmooney this looks good to me just one question. Is it possible to use xe-0/0/[46-47] for the links to cr* the fiber coming out of the rack from the switch

Sure yep I have changed it in Netbox now.

I tried to enable the CR uplinks from the new cloudsw but there is a bit of a snag.

The CR doesn't show an optic present in slot 1/0/4:

cmooney@re0.cr1-codfw> show chassis pic pic-slot 0 fpc-slot 1   
FPC slot 1, PIC slot 0 information:
  Type                             MRATE-6xQSFPP-XGE-XLGE-CGE
  State                            Online    
  PIC version                  0.0
  Uptime			 143 days, 9 hours, 26 minutes, 54 seconds

PIC port information:
                         Fiber                    Xcvr vendor       Wave-                     Xcvr          JNPR     MSA
  Port Cable type        type  Xcvr vendor        part number       length                    Firmware      Rev      Version
  0    40GBASE SR4       MM    AVAGO              AFBR-79EQDZ-JU2   850 nm                    0.0           REV 01   SFF-8436 ver n/a
  1    4X10GBASE LR      SM    FS                 QSFP-PLR4-40G     1310 nm                   0.0           REV 01   SFF-8636 ver 2.7
  3    40GBASE SR4       MM    AVAGO              AFBR-79EQDZ-JU1   850 nm                    0.0           REV 01   SFF-8436 ver n/a

I added this configuration to channelize the QSFP:

cmooney@re0.cr1-codfw> show configuration chassis fpc 1 pic 0 port 4             
number-of-sub-ports 4;
speed 10g;

But no change, the system does now report that the PIC needs to be bounced:

cmooney@re0.cr1-codfw> show chassis alarms 
1 alarms currently active
Alarm time               Class  Description
2023-02-03 18:49:47 UTC  Minor  FPC 1 PIC 0 Need bounce

Speaking to Papaul I think we need to do this for the optic / interfaces to be usable. But given that is disruptive we will need to carefully plan it. Same goes for cr2.

fyi i tested connecting temporary the xe-0/0/47 to cr2 xe-5/0/0 link was okay

papaul@re0.cr2-codfw# run show interfaces terse xe-5/0/0
Interface               Admin Link Proto    Local                 Remote
xe-5/0/0                up    up

This ticket had little activity in the last month. Did something happen offline that wasn't recorded in here?

@aborrero I've been getting the cloudsw configured in the background, which is nearly done. More recently I've been on leave so just getting back to it now. I'll set up a plan to get the CR links enabled and we'll be able to start thinking about the migration.

No more time off planned in March so will be able to prioritize it and getting the new vlans in place.

thanks for the update!

Please let me know if there is something I can do to help with this (no switch config but perhaps testing, double checking stuff, IP allocation, connectivity, etc)

Change 895848 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add reverse DNS origin entries for newly allocated IPv6 ranges

https://gerrit.wikimedia.org/r/895848

Change 895848 merged by Cathal Mooney:

[operations/dns@master] Add reverse DNS origin entries for newly allocated ranges.

https://gerrit.wikimedia.org/r/895848

cmooney added a subtask: Restricted Task.
cmooney closed subtask Restricted Task as Resolved.Thu, Mar 9, 11:14 AM

Please let me know if there is something I can do to help with this (no switch config but perhaps testing, double checking stuff, IP allocation, connectivity, etc)

Thanks @aborrero

Config of the new switch is progressing well, just waiting on two cable moves (see T331470#8676018) and I will migrate the uplink/gateway for the cloud vlans from CR routers to the new switch.

Once that's done we can try to reimage / install OS on the new cloudlb's. If that goes to plan we can migrate the existing hosts over from old switch to new. If you can have a think about what's involved do do both of those that'd be great. No IPs etc. need to change so I think it should just be a matter of arranging the downtime and co-ordinating with DC-Ops. Thanks!

Please let me know if there is something I can do to help with this (no switch config but perhaps testing, double checking stuff, IP allocation, connectivity, etc)

Thanks @aborrero

Config of the new switch is progressing well, just waiting on two cable moves (see T331470#8676018) and I will migrate the uplink/gateway for the cloud vlans from CR routers to the new switch.

Once that's done we can try to reimage / install OS on the new cloudlb's. If that goes to plan we can migrate the existing hosts over from old switch to new. If you can have a think about what's involved do do both of those that'd be great. No IPs etc. need to change so I think it should just be a matter of arranging the downtime and co-ordinating with DC-Ops. Thanks!

Ok thanks!

In the past we had problems with DHCP forwarding between the switches for PXE boot, so if you can double check that while configuring the links, that would be great.

Other than that, we're ready to reimage cloudlb anytime. Also because of T329865: Q3:rack/setup/install cloudlb200[23]-dev soon we will have 3 total cloudlbs to test and continue with T324992: cloudlb: create PoC on codfw

In the past we had problems with DHCP forwarding between the switches for PXE boot, so if you can double check that while configuring the links, that would be great.

We don't really have the potential to hit that issue here, as the switch is the L3 gateway, and in any event there is only one involved. I'll definitely make sure everything is in place to support DHCP/reimage. Unfortunately I don't have a true-and-true way to artificially check everything is working, I was hoping we could try the cloudlb reimage and use that as a test?

Other than that, we're ready to reimage cloudlb anytime. Also because of T329865: Q3:rack/setup/install cloudlb200[23]-dev soon we will have 3 total cloudlbs to test and continue with T324992: cloudlb: create PoC on codfw

Great!

Some updates on the physicals for the new cloudsw.

The links to core routers are now up and configured following T331601:

cmooney@re0.cr1-codfw> show interfaces descriptions | match cloudsw 
xe-1/0/4:0      up    up   Core: cloudsw1-b1-codfw:xe-0/0/46 {#122350}
xe-1/0/4:0.1000 up    up   cloudsw1-b1-codfw prod
xe-1/0/4:0.1001 up    up   cloudsw1-b1-codfw cloud-vrf

cmooney@re0.cr2-codfw> show interfaces descriptions | match cloudsw 
xe-1/0/4:0      up    up   Core: cloudsw1-b1-codfw:xe-0/0/47 {#122351_122352-1}
xe-1/0/4:0.1000 up    up   cloudsw1-b1-codfw prod
xe-1/0/4:0.1001 up    up   cloudsw1-b1-codfw cloud-vrf

As are the links to asw-b2-codfw thanks to T331470:

cmooney@cloudsw1-b1-codfw> show interfaces descriptions | match asw         
ge-0/0/40       up    up   Core: asw-b-codfw:ge-1/0/31
ge-0/0/41       up    up   Core: asw-b-codfw:ge-1/0/32
ae0             up    up   asw-b-eqiad ae0 - for migration

Next step is to bring up the BGP peering from cloudsw to core routers (in both VRFs), then migrate the gateways for the cloud-hosts1-codfw and cloud-instance-transport1-b-codfw vlans away from the CRs to the new switch.

Mentioned in SAL (#wikimedia-operations) [2023-03-09T16:51:46Z] <topranks> Add EBGP peering from cr1-codfw to cloudsw1-b1-codfw (cloud vrf) T327919

Mentioned in SAL (#wikimedia-operations) [2023-03-09T17:13:15Z] <topranks> Add EBGP peering from cr1-codfw to cloudsw1-b1-codfw (prod links) T327919

Change 896160 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad

https://gerrit.wikimedia.org/r/896160

Change 896160 merged by jenkins-bot:

[operations/homer/public@master] Homer changes as part of WMCS codfw migration to cloudsw1-b1-eqiad

https://gerrit.wikimedia.org/r/896160

Change 896172 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add uRPF checks for new cloudsw interfaces

https://gerrit.wikimedia.org/r/896172

Change 896172 merged by jenkins-bot:

[operations/homer/public@master] Add uRPF checks for new cloudsw interfaces

https://gerrit.wikimedia.org/r/896172

Mentioned in SAL (#wikimedia-operations) [2023-03-09T20:24:37Z] <topranks> move cloud-hosts1-b-codfw GW from core routers to cloudsw1-b1-codfw T327919

Change 896179 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove uRPF filter for interface ae1.2118 on codfw CRs

https://gerrit.wikimedia.org/r/896179

Change 896179 merged by jenkins-bot:

[operations/homer/public@master] Remove uRPF filter for interface ae1.2118 on codfw CRs

https://gerrit.wikimedia.org/r/896179

Change 896200 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Restrict prefix length for public announce, allow bgp for cloud range

https://gerrit.wikimedia.org/r/896200

Change 896200 merged by jenkins-bot:

[operations/homer/public@master] Restrict prefix length for public announce, allow bgp for cloud range

https://gerrit.wikimedia.org/r/896200

I've added the BGP config and moved the GW interfaces from the two CRs in codfw to the cloudsw.

Things look ok on both counts, hosts can reach resources outside their own subnets / cloud network(s):

cmooney@cloudgw2003-dev:~$ sudo traceroute -I -w 1 elastic1057.eqiad.wmnet 
traceroute to elastic1057.eqiad.wmnet (10.64.32.93), 30 hops max, 60 byte packets
 1  irb-2118.cloudsw1-b1-codfw.codfw.wmnet (10.192.20.1)  1.030 ms  1.006 ms  1.000 ms
 2  xe-1-0-4-0-1000.cr1-codfw.wikimedia.org (10.192.254.0)  0.727 ms  0.765 ms  0.762 ms
 3  xe-4-2-0.cr1-eqiad.wikimedia.org (208.80.153.220)  31.097 ms  31.093 ms  31.089 ms
 4  elastic1057.eqiad.wmnet (10.64.32.93)  31.639 ms * *
cmooney@cloudgw2003-dev:~$ sudo traceroute6 -I -w 1 elastic1057.eqiad.wmnet 
traceroute to elastic1057.eqiad.wmnet (2620:0:861:103:10:64:32:93), 30 hops max, 80 byte packets
 1  irb-2118.cloudsw1-b1-codfw.codfw.wmnet (2620:0:860:118::1)  0.632 ms  0.596 ms  0.589 ms
 2  xe-1-0-4-0-1000.cr1-codfw.wikimedia.org (2620:0:860:130::1)  0.496 ms  0.531 ms  0.527 ms
 3  xe-4-2-0.cr1-eqiad.wikimedia.org (2620:0:860:fe01::1)  30.224 ms  30.220 ms  30.279 ms
 4  elastic1057.eqiad.wmnet (2620:0:861:103:10:64:32:93)  31.676 ms * *
cmooney@cloudgw2003-dev:~$ sudo ip vrf exec vrf-cloudgw traceroute -I -n -w 1 -s 185.15.57.9 1.1.1.1
traceroute to 1.1.1.1 (1.1.1.1), 30 hops max, 60 byte packets
 1  208.80.153.185  4.638 ms  4.598 ms  4.591 ms
 2  208.80.153.176  0.498 ms  0.494 ms  0.527 ms
 3  208.80.153.219  0.369 ms  0.366 ms  0.399 ms
 4  * * *
 5  172.71.168.2  2.376 ms  2.373 ms  2.367 ms
 6  1.1.1.1  1.603 ms  1.622 ms  1.614 ms
root@uk:~# mtr -z -b -w -c 10 185.15.57.9
Start: 2023-03-10T00:58:13+0100
HOST: uk.rankinrez.net                                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS46261  91.132.85.1                                                        0.0%    10    1.1   1.2   1.0   1.4   0.1
  2. AS20860  185.91.76.101                                                      0.0%    10    0.6   0.7   0.5   1.4   0.3
  3. AS20860  be6.3222.asr01.dc13.as20860.net (130.180.203.223)                  0.0%    10    6.4   6.4   6.1   7.9   0.5
  4. AS20860  be16.asr01.ld5.as20860.net (130.180.202.0)                         0.0%    10    6.5   6.7   6.5   6.9   0.1
  5. AS6461   ae7.mpr1.lhr23.uk.zip.zayo.com (94.31.48.81)                       0.0%    10    7.0   6.8   5.9  10.9   1.5
  6. AS6461   ae11.cs1.lhr15.uk.zip.zayo.com (64.125.28.32)                     70.0%    10    7.0   7.2   7.0   7.3   0.1
  7. AS6461   ae3.mpr1.lhr15.uk.zip.zayo.com (64.125.28.151)                     0.0%    10    6.8   7.4   6.8   9.1   0.9
  8. AS1299   ldn-b1-link.ip.twelve99.net (62.115.59.45)                         0.0%    10    7.3   7.4   7.3   7.5   0.1
  9. AS1299   ldn-bb4-link.ip.twelve99.net (62.115.143.26)                       0.0%    10    9.0   9.1   7.2  14.0   2.2
 10. AS1299   nyk-bb1-link.ip.twelve99.net (62.115.112.244)                      0.0%    10   75.2  75.0  74.8  75.2   0.1
 11. AS1299   rest-bb1-link.ip.twelve99.net (62.115.141.244)                     0.0%    10   81.3  81.5  81.3  81.7   0.1
 12. AS1299   atl-bb1-link.ip.twelve99.net (62.115.138.71)                      70.0%    10   95.6  95.6  95.5  95.7   0.1
 13. AS???    ???                                                               100.0    10    0.0   0.0   0.0   0.0   0.0
 14. AS1299   dls-bb1-link.ip.twelve99.net (62.115.137.45)                      80.0%    10  114.2 114.3 114.2 114.4   0.2
 15. AS1299   dls-b24-link.ip.twelve99.net (62.115.136.83)                       0.0%    10  114.5 114.5 114.2 115.1   0.3
 16. AS1299   dls-b4-link.ip.twelve99.net (62.115.113.117)                       0.0%    10  114.9 115.1 114.8 115.5   0.2
 17. AS1299   wikimedia-ic308846-dls-b22.ip.twelve99-cust.net (80.239.192.102)   0.0%    10  114.7 118.8 114.6 155.3  12.8
 18. AS14907  xe-0-0-46-1001.cloudsw1-b1-codfw.wikimedia.org (208.80.153.177)    0.0%    10  120.7 119.9 116.2 129.7   3.8
 19. AS14907  virt.cloudgw.codfw1dev.wikimediacloud.org (185.15.57.9)            0.0%    10  114.7 114.9 114.6 117.9   1.0

Change 896310 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Allow cloudsw in codfw to announce 208.80.153.184/29

https://gerrit.wikimedia.org/r/896310

Change 896310 merged by jenkins-bot:

[operations/homer/public@master] Allow cloudsw in codfw to announce 208.80.153.184/29

https://gerrit.wikimedia.org/r/896310

Change 896329 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for new private IP ranges in use in codfw

https://gerrit.wikimedia.org/r/896329

Change 896331 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add Homer device entry for cloudsw1-b-codfw

https://gerrit.wikimedia.org/r/896331

Change 896333 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Puppet additions for new network device cloudsw1-b1-codfw

https://gerrit.wikimedia.org/r/896333

Change 896331 merged by jenkins-bot:

[operations/homer/public@master] Add Homer device entry for cloudsw1-b-codfw

https://gerrit.wikimedia.org/r/896331

Mentioned in SAL (#wikimedia-operations) [2023-03-10T12:49:04Z] <cmooney@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync data for new cloudsw1-b1-codfw device. - cmooney@cumin1001 - T327919"

Mentioned in SAL (#wikimedia-operations) [2023-03-10T12:50:57Z] <cmooney@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync data for new cloudsw1-b1-codfw device. - cmooney@cumin1001 - T327919"

Change 896333 merged by Cathal Mooney:

[operations/puppet@production] Puppet additions for new network device cloudsw1-b1-codfw

https://gerrit.wikimedia.org/r/896333

Change 896350 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Update switch_loopback prefix lists to include assigned codfw ranges

https://gerrit.wikimedia.org/r/896350

Change 896350 merged by jenkins-bot:

[operations/homer/public@master] Update switch_loopback prefix lists to include assigned codfw ranges

https://gerrit.wikimedia.org/r/896350

Change 896329 merged by Cathal Mooney:

[operations/dns@master] Add includes for new private IP ranges in use in codfw

https://gerrit.wikimedia.org/r/896329

We've tested PXEboot / DHCP / OS install for 2 new servers (cloudlb2002-dev and cloudlb2003-dev) on this switch now and it's working ok.

So all now looks ok in terms of this switch, it has been added to monitoring and all is green (apart from system alarm warning which relates to missing BGP license, which we've been chasing our supplier about for some time, we paid the insane price for it already).

Next step is to plan the migration of the existing hosts with WMCS and DC-Ops.

I noticed that it's running 19.1R3-S2.3 we should upgrade it to latest Junos recommended before we move more servers to it.

I noticed that it's running 19.1R3-S2.3 we should upgrade it to latest Junos recommended before we move more servers to it.

Yes @ayounsi absolutely we should yes, should have mentioned above.

Change 898678 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Modify policy to use in aggregate for 185.15.57.0/24 in codfw

https://gerrit.wikimedia.org/r/898678

Change 898684 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Move cloudsw prefix-list filters from templates to YAML

https://gerrit.wikimedia.org/r/898684

Change 898678 merged by jenkins-bot:

[operations/homer/public@master] Modify policy to use in aggregate for 185.15.57.0/24 in codfw

https://gerrit.wikimedia.org/r/898678

Icinga downtime and Alertmanager silence (ID=484494a0-6cb6-4421-b546-9b17aa96a3a6) set by cmooney@cumin1001 for 0:30:00 on 3 host(s) and their services with reason: cloudsw1-b1-codfw OS upgrade

cloudsw1-b1-codfw,cloudsw1-b1-codfw IPv6,cloudsw1-b1-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2023-03-14T19:47:17Z] <topranks> Reboot cloudsw1-b1-codfw to upgrade JunOS version T327919

@Papaul, I suggest we move the ports from the existing switch to the new switch in 3 batches. The cloudcephosd hosts we need to have at least 2 working at all times, so we will do 1 of these in each batch to avoid downtime.

Batch 1:

HostExisting portNew Port
cloudcephosd2001-devasw-b1-codfw ge-1/0/4cloudsw1-b1-codfw ge-0/0/4
cloudcephosd2001-devasw-b1-codfw ge-1/0/5cloudsw1-b1-codfw ge-0/0/5
cloudvirt2001-devasw-b1-codfw ge-1/0/6cloudsw1-b1-codfw ge-0/0/6
cloudcephmon2004-devasw-b1-codfw ge-1/0/9cloudsw1-b1-codfw ge-0/0/9
cloudvirt2003-devasw-b1-codfw ge-1/0/10cloudsw1-b1-codfw ge-0/0/10
cloudlb2001-devasw-b1-codfw ge-1/0/11cloudsw1-b1-codfw ge-0/0/11
cloudcontrol2005-devasw-b1-codfw ge-1/0/14cloudsw1-b1-codfw ge-0/0/14
clouddb2002-devasw-b1-codfw ge-1/0/15cloudsw1-b1-codfw ge-0/0/15

Batch 2:

HostExisting portNew Port
cloudcephosd2003-devasw-b1-codfw ge-1/0/7cloudsw1-b1-codfw ge-0/0/7
cloudcephosd2003-devasw-b1-codfw ge-1/0/8cloudsw1-b1-codfw ge-0/0/8
cloudgw2003-devasw-b1-codfw ge-1/0/16cloudsw1-b1-codfw ge-0/0/16
cloudvirt2002-devasw-b1-codfw ge-1/0/2cloudsw1-b1-codfw ge-0/0/17
cloudcontrol2001-devasw-b1-codfw ge-1/0/18cloudsw1-b1-codfw ge-0/0/18
cloudgw2002-devasw-b1-codfw ge-1/0/19cloudsw1-b1-codfw ge-0/0/19
cloudcephmon2005-devasw-b1-codfw ge-1/0/21cloudsw1-b1-codfw ge-0/0/21

Batch 3:

HostExisting portNew Port
cloudcephosd2002-devasw-b1-codfw ge-1/0/12cloudsw1-b1-codfw ge-0/0/12
cloudcephosd2002-devasw-b1-codfw ge-1/0/0cloudsw1-b1-codfw ge-0/0/13
cloudcephmon2006-devasw-b1-codfw ge-1/0/22cloudsw1-b1-codfw ge-0/0/22
cloudnet2005-devasw-b1-codfw ge-1/0/23cloudsw1-b1-codfw ge-0/0/23
cloudnet2006-devasw-b1-codfw ge-1/0/25cloudsw1-b1-codfw ge-0/0/25
cloudservices2004-devasw-b1-codfw ge-1/0/28cloudsw1-b1-codfw ge-0/0/36
cloudservices2005-devasw-b1-codfw ge-1/0/29cloudsw1-b1-codfw ge-0/0/37
cloudweb2002-devasw-b1-codfw ge-1/0/30cloudsw1-b1-codfw ge-0/0/38

I've highlighted in bold the ones where we are using a different port number on the new switch. That can't be avoided for some of them as the equivalent port on the new one is in use, or the block it comes from is running at 10G.

In terms of doing the work we can do it all on the same day, one batch after another. Basic approach will be you start doing the cables for a batch, I will change Netbox config for the switch ports, and run Homer to update. Then when we're both done we check all looks ok with the connections to new switches, and if so move on to next batch.

Change 898684 merged by jenkins-bot:

[operations/homer/public@master] Move cloudsw prefix-list filters from templates to YAML

https://gerrit.wikimedia.org/r/898684

In terms of the move we need to work with @aborrero and the team to decide when is good to do the work. We can do it in a number of batches or all in one go, whatever you guys think is best. I can move the interfaces in Netbox and configure the new switch in advance in either case.

All hosts can be done anytime. For cloudceph* nodes in particular, I'll double check with @dcaro in case he wants to stay on top of it.

Change 900264 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add protocol direct to Cloud_outfilter protocols

https://gerrit.wikimedia.org/r/900264

Change 900264 merged by Cathal Mooney:

[operations/homer/public@master] Add protocol direct to Cloud_outfilter protocols

https://gerrit.wikimedia.org/r/900264

@cmooney thank you for getting the table ready for the cloud nodes move. As you can see on asw-b1 and in netbox i tried to keep the server racked location to match the server connection on the switch in codfw . For example if a server is racked in U1 it will use [ge-xe]-0/0/0 on the switch.

I know the first 4 ports on cloudsw1-b1 are set up as 10G because we have sretest2001 connected to the first 2 port we can unracked that server since it was a test server and set those interfaces to 1G and have cloudvirt2002-dev and cloudvirt2002-dev keep their same interfaces as on the old switch.

For cloudservices2004-dev, cloudservices2005-dev and cloudweb2002-dev I can relocate them to U35,U36 and U37.

This will be valid also for all the other racks in the row A/row B refresh.

Also since we move the switch at U48 I already requested some cables to connect the switch to server that at racked from U1 to U10 since I need long cables.

Thank you.

I know the first 4 ports on cloudsw1-b1 are set up as 10G because we have sretest2001 connected to the first 2 port we can unracked that server since it was a test server and set those interfaces to 1G and have cloudvirt2002-dev and cloudvirt2002-dev keep their same interfaces as on the old switch.

Yep that works fine. I'll delete the connection from sretest to the switch and remove the interfaces in Netbox.

For cloudservices2004-dev, cloudservices2005-dev and cloudweb2002-dev I can relocate them to U35,U36 and U37.

That works apart from cloudservices2004-dev. Port 35 is part of block 32, and is at 10G because xe-0/0/32 is connecting cloudlb2003-dev.

This will be valid also for all the other racks in the row A/row B refresh.

While I think this is a nice approach, I am kind of thinking it will be a lot more trouble to try and maintain longer term. Managing what blocks of ports are at 1/10/25 is difficult enough, but if that also translates to where the servers need to go in the rack it could get complicated. I can imagine us needing to move a server in the rack to move from 1 to 10G for instance, which seems like a lot of trouble to me. But your call at the end of the day.

Also since we move the switch at U48 I already requested some cables to connect the switch to server that at racked from U1 to U10 since I need long cables.

Ok cool, so we need to wait until we get those cables? Are they ordered do you know?

Cheers.

cloudservices2004-dev = U37
cloudservices2005-dev = U38
cloudweb2002-dev = 39

yes we already order the cable arriving sometimes in April but we don't have to wait there are some servers that we can move. I will get you the list of those sometimes next week if possible.

To keep the block of 1G 10G on the switch what I can suggest will be to configure from left to right block of 1G and from right to left block of 10G. For the racking part leave it to me.Also We onsite we will be setting those servers and doing the switch configuration so I can take care of where the server will be and on witch switch port.

Ok cool. So I'd propose we take it like this:

1. Move sretest2001 from port xe-0/0/1 to xe-0/0/45

This can be done anytime, just let me know when done I'll change netbox.

2. First batch of cloud servers:

We can do these now without waiting on the longer cables to arrive:

HostExisting portNew Port
cloudlb2001-devasw-b1-codfw ge-1/0/11cloudsw1-b1-codfw ge-0/0/11
clouddb2002-devasw-b1-codfw ge-1/0/15cloudsw1-b1-codfw ge-0/0/15
cloudgw2003-devasw-b1-codfw ge-1/0/16cloudsw1-b1-codfw ge-0/0/16
cloudgw2002-devasw-b1-codfw ge-1/0/19cloudsw1-b1-codfw ge-0/0/19
cloudcephmon2005-devasw-b1-codfw ge-1/0/21cloudsw1-b1-codfw ge-0/0/21
cloudcephmon2006-devasw-b1-codfw ge-1/0/22cloudsw1-b1-codfw ge-0/0/22
cloudnet2005-devasw-b1-codfw ge-1/0/23cloudsw1-b1-codfw ge-0/0/23
cloudnet2006-devasw-b1-codfw ge-1/0/25cloudsw1-b1-codfw ge-0/0/25

3. Second batch of cloud servers:

These need to wait until the cables arrive, we still will do it in two batches after that as we need to move the cloudceph one by one.

HostExisting portNew Port
cloudcephosd2002-devasw-b1-codfw ge-1/0/0cloudsw1-b1-codfw ge-0/0/0
cloudvirt2002-devasw-b1-codfw ge-1/0/2cloudsw1-b1-codfw ge-0/0/2
cloudvirt2001-devasw-b1-codfw ge-1/0/6cloudsw1-b1-codfw ge-0/0/6
cloudcephosd2002-devasw-b1-codfw ge-1/0/12cloudsw1-b1-codfw ge-0/0/12

4. Third batch of cloud servers

HostExisting portNew Port
cloudcephosd2001-devasw-b1-codfw ge-1/0/4cloudsw1-b1-codfw ge-0/0/4
cloudcephosd2001-devasw-b1-codfw ge-1/0/5cloudsw1-b1-codfw ge-0/0/5

5. Forth and last batch:

cloudcephosd2003-devasw-b1-codfw ge-1/0/7cloudsw1-b1-codfw ge-0/0/7
cloudcephosd2003-devasw-b1-codfw ge-1/0/8cloudsw1-b1-codfw ge-0/0/8
cloudcephmon2004-devasw-b1-codfw ge-1/0/9cloudsw1-b1-codfw ge-0/0/9
cloudvirt2003-devasw-b1-codfw ge-1/0/10cloudsw1-b1-codfw ge-0/0/10

To keep the block of 1G 10G on the switch what I can suggest will be to configure from left to right block of 1G and from right to left block of 10G. For the racking part leave it to me.Also We onsite we will be setting those servers and doing the switch configuration so I can take care of where the server will be and on witch switch port.

The left to right order isn't a bad idea. Ultimately your call so I'm happy if you want to stick to the pattern.

Lastly I'd included these in my previous list, but in fact they are connected to the codfw public-b vlan right now. We have a few options in terms of that, but right now I don't want to move them / bridge the public vlan onto the new switch, so let's not move them for now.

cloudcontrol2001-dev
cloudcontrol2005-dev
cloudservices2004-dev
cloudservices2005-dev
cloudweb2002-dev

Change 900448 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add cloudsw1-b1-codfw to Rancid

https://gerrit.wikimedia.org/r/900448

Change 900448 merged by Cathal Mooney:

[operations/puppet@production] Add cloudsw1-b1-codfw to Rancid

https://gerrit.wikimedia.org/r/900448

@cmooney We can move any servers racked from U11 up

@cmooney We can move any servers racked from U11 up

I'm not sure it's worth going to the trouble for the sake of a few weeks.

If we can do the first batch next week we should be able to continue with the tests on cloudlb which is the main thing hanging on this, we can move the rest when the cables arrive.

@cmooney Please see first batch proposal. We can move all those servers next week. @aborrero can you please let us know when will be the best day and time next week for of to move those servers? Thank you

HostU spaceExisting portNew Port
cloudlb2001-dev15asw-b1-codfw ge-1/0/11cloudsw1-b1-codfw ge-0/0/14
clouddb2002-dev17asw-b1-codfw ge-1/0/15cloudsw1-b1-codfw ge-0/0/16
cloudgw2003-dev18asw-b1-codfw ge-1/0/16cloudsw1-b1-codfw ge-0/0/17
cloudgw2002-dev7asw-b1-codfw ge-1/0/19cloudsw1-b1-codfw ge-0/0/6
cloudcephmon2005-dev8asw-b1-codfw ge-1/0/21cloudsw1-b1-codfw ge-0/0/7
cloudcephmon2006-dev9asw-b1-codfw ge-1/0/22cloudsw1-b1-codfw ge-0/0/8
cloudnet2005-dev10asw-b1-codfw ge-1/0/23cloudsw1-b1-codfw ge-0/0/9
cloudnet2006-dev11asw-b1-codfw ge-1/0/25cloudsw1-b1-codfw ge-0/0/10

@cmooney Please see first batch proposal. We can move all those servers next week. @aborrero can you please let us know when will be the best day and time next week for of to move those servers? Thank you

Can be done anytime, thanks!

@cmooney Please see first batch proposal. We can move all those servers next week. @aborrero can you please let us know when will be the best day and time next week for of to move those servers? Thank you

Can be done anytime, thanks!

Thanks Papaul for the update looks good.

I'd say we either do Monday 27th, or Wednesday 29th. Tuesday we best avoid so we can focus on the Eqiad row B switch upgrade (T330165).