Page MenuHomePhabricator

Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches
Closed, ResolvedPublic

Assigned To
Authored By
cmooney
Jun 7 2024, 7:49 PM
Referenced Files
F56509257: image.png
Jul 18 2024, 6:53 PM
File Not Attached
F56509244: image.png
Jul 18 2024, 6:53 PM
File Not Attached
F55318552: out-12.png
Jun 14 2024, 4:56 PM
F55318550: out-11.png
Jun 14 2024, 4:56 PM
F55318539: out-10.png
Jun 14 2024, 4:56 PM
F55318530: out-9.png
Jun 14 2024, 4:56 PM
F55318526: out-8.png
Jun 14 2024, 4:56 PM
F55318523: out-7.png
Jun 14 2024, 4:56 PM

Description

To complete the network configuration for the new codfw switches in row's C and D we need to migrate the uplinks from the current asw-c-codfw and asw-d-codfw virtual-chassis so they terminate on the new Spine swithces in row D instead of directly on the core routers.

We need to wait until we have the transceivers installed on the links between spines and CRs before we kick off (see T360789).

This is quite a delicate change but should be possible without causing any interruption to service. It may make sense to do a dns depool of the site in advance as a precaution and to reduce traffic levels (as at many points we will be on 1 leg rather than 2). Steps are roughly as follows:

In advance:

  1. Create ESI-LAG ae0 on both spine switches, with single member port, et-0/0/28, on each (for asw-c)
  2. Create ESI-LAG ae1 on both spine switches, with single member port, et-0/0/27, on each (for asw-d)
  3. Add the current set of vlans in use on asw-c-codfw and asw-d-codfw to ssw1-d1-codfw and ssw1-d8-codfw
  4. Trunk those vlans over ssw1-d8-codfw port et-0/0/31towards cr2-codfw
  5. Trunk the row C vlans over ae0 on the spines
  6. Trunk the row D vlans over ae1 on the spines

On day of move:

NOTE: All these diagrams show 'asw-a' which is an error, should be 'asw-c' if I get time I will change and re-export them all.

out-1.png (527×773 px, 43 KB)

  1. Change the VRRP priority on cr2-codfw so that it is VRRP master for the row C vlans
  2. Disable interfaces et-1/1/0 and ae3 on cr1-codfw (ports facing asw-c-codfw)
    • Inbound and outbound traffic to row C still flows from cr2-codfw et-1/1/0 to asw-c7-codfw et-7/0/52

out-2.png (527×773 px, 42 KB)

  1. Re-cable asw-c2-codfw et-2/0/51 to ssw1-d1-codfw et-0/0/28

out-3.png (527×773 px, 42 KB)

  1. Check row C MAC addresses are learnt on ssw1-d1-codfw ae0 and known on ssw1-d8-codfw via evpn
  2. Change the VRRP priority on cr2-codfw so that it is VRRP master for the row D vlans
  3. Disable interfaces et-1/1/3 and ae4 on cr1-codfw (ports facing asw-d-codfw)
    • Inbound and outbound traffic to row D still flows from cr2-codfw et-1/1/3 to asw-c7-codfw et-7/0/52

out-4.png (527×773 px, 43 KB)

  1. Re-cable asw-d2-codfw et-2/0/51 to ssw1-d1-codfw et-0/0/27

out-5.png (527×773 px, 40 KB)

  1. Check row D MAC addresses are learnt on ssw1-d1-codfw ae1 and known on ssw1-d8-codfwvia evpn
  2. Drain transport circuits to cr4-ulsfo, cr2-eqdfw and cr1-eqiad on cr1-codfw
  3. Reconfigure PIC 1/1 on cr1-codfw
    • Delete the 40G configuration for ports 0 and 3
    • Add a 100G configuration for port 2
  4. Reset PIC 1/1 on cr1-codfw
  5. Un-drain transport circuits to cr4-ulsfo, cr2-eqdfw and cr1-eqiad on cr1-codfw
  6. Enable cr1-codfw et-1/1/2 in Netbox (connects to ssw1-d1-codfw) and push with Homer
  7. Move the cr1-codfw ae3 and ae4 vlan sub-interfaces to et-1/1/2
  8. Add the new sub-ints of et-1/1/2 as OSPF stub interfaces, remove the old ae3/ae4 sub-ints from OSPF
  9. Check on cr2-codfw that it now sees cr1-codfw as VRRP backup for all vlans again
    • L2 path here is from cr2-codfw -> asw -> ssw -> cr1-codfw

out-6.png (527×773 px, 45 KB)

  1. Reconfigure the cr1-codfw new sub-interfaces so it they are VRRP masters for row C vlans

out-7.png (527×773 px, 45 KB)

  1. Disable cr2-codfw et-1/1/0 and ae3 (ports facing asw-c-codfw)

out-8.png (527×773 px, 45 KB)

  1. Move cr2-codfw ae3 sub-interface configs to et-1/0/2 (port facing ssw1-d8-codfw)
  2. Check on cr1-codfw that it sees cr2-codfw as VRRP backup for row C vlans again
    • L2 path here is from cr1-codfw -> ssw1-d1 -> ssw1-d8 -> cr2-codfw

out-9.png (527×773 px, 46 KB)

  1. Re-cable asw-c7-codfw et-7/0/52 to ssw1-d8-codfw et-0/0/28 (had been going to cr2)
  2. Reconfigure asw-c-codfw et-7/0/52 so it is part of ae1, and delete ae2 config
  3. Check ESI-LAG on et-0/0/28 of both spines is working and ae1 up on spines and asw-c-codfw

out-10.png (527×773 px, 50 KB)

  1. Reconfigure the cr1-codfw new sub-interfaces so they are VRRP masters for row D vlans
  2. Disable cr2-codfw et-1/1/3 and ae4 (ports facing asw-d-codfw)
  3. Move cr2-codfw ae4 sub-interface configs to et-1/0/2 (port facing ssw1-d8-codfw)
  4. Check on cr1-codfw that it sees cr2-codfw as VRRP backup for row D vlans again
    1. L2 path here is from cr1-codfw -> ssw1-d1 -> ssw1-d8 -> cr2-codfw

out-11.png (527×773 px, 51 KB)

  1. Re-cable asw-d7-codfw et-7/0/52 to ssw1-d8-codfw et-0/0/27 (had been going to cr2)
  2. Reconfigure asw-d-codfw et-7/0/52 so it is part of ae1, and delete ae2 config
  3. Check ESI-LAG on et-0/0/27 of both spines is working and ae2 up (spines) and ae1 up (asw-d-codfw)

out-12.png (527×773 px, 47 KB)

  1. Re-balance VRRP settings on both CRs as they had been prior to start

Event Timeline

cmooney triaged this task as Medium priority.Jun 7 2024, 7:49 PM
cmooney created this task.

@Papaul sorry I meant to get back to you sooner. I've made decent progress on T369106 and managed to test reimage working ok in one of the new racks with the test server, so we are in a good position.

How does Thurs July 18th sound as a day to do this? Gives us a bit of time to plan and line everything up and let wider SRE know.

cmooney updated the task description. (Show Details)

Change #1054942 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add identifiers for ESI-LAGs to legacy switches on codfw row D spines

https://gerrit.wikimedia.org/r/1054942

Change #1054942 merged by jenkins-bot:

[operations/homer/public@master] Add identifiers for ESI-LAGs to legacy switches on codfw row D spines

https://gerrit.wikimedia.org/r/1054942

Change #1055169 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add monitoring checks for codfw row D spines

https://gerrit.wikimedia.org/r/1055169

Change #1055183 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Disable config for RA generation on Spines in codfw

https://gerrit.wikimedia.org/r/1055183

Change #1055183 merged by jenkins-bot:

[operations/homer/public@master] Disable config for RA generation on Spines in codfw

https://gerrit.wikimedia.org/r/1055183

Change #1055205 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Disable BGP peering from cr2-codfw to ssw1-d8-codfw

https://gerrit.wikimedia.org/r/1055205

Change #1055205 abandoned by Cathal Mooney:

[operations/homer/public@master] Disable BGP peering from cr2-codfw to ssw1-d8-codfw

Reason:

will just delete the IRB ints in netbox and re-create them when I need

https://gerrit.wikimedia.org/r/1055205

Mentioned in SAL (#wikimedia-operations) [2024-07-18T12:52:36Z] <topranks> re-enabling BGP between spine-layer switches in codfw (problematic IP interfaces have been deleted) T366941

Mentioned in SAL (#wikimedia-operations) [2024-07-18T12:55:52Z] <topranks> re-enabling interface et-1/0/2 on cr2-codfw which connects to ssw1-d8-codfw (problemtic IP interfaces have been deleted) T366941

Icinga downtime and Alertmanager silence (ID=8062b5f0-d6f0-401c-9dfd-590a5facd0ad) set by cmooney@cumin1002 for 3:00:00 on 4 host(s) and their services with reason: Move asw-c-codfw and asw-d-codfw CR uplinks

cr[1-2]-codfw,ssw1-d[1,8]-codfw

Mentioned in SAL (#wikimedia-operations) [2024-07-18T15:17:02Z] <topranks> disabling interface et-1/1/0 on cr1-codfw (facing asw-c-codfw) T366941

Mentioned in SAL (#wikimedia-operations) [2024-07-18T15:19:35Z] <topranks> disabling interface et-1/1/3 on cr1-codfw (facing asw-d-codfw) T366941

Icinga downtime and Alertmanager silence (ID=fdebcc6c-adaa-42f3-809d-4ec381a4798d) set by cmooney@cumin1002 for 0:20:00 on 2 host(s) and their services with reason: bouncing line card on cr1-codfw

cloudsw1-b1-codfw.mgmt,pfw3-codfw

Icinga downtime and Alertmanager silence (ID=1b177f94-1995-41ab-90b9-673cef9dbf94) set by cmooney@cumin1002 for 0:20:00 on 2 host(s) and their services with reason: bouncing line card on cr1-codfw

cloudsw1-b1-codfw.mgmt,pfw3-codfw

Icinga downtime and Alertmanager silence (ID=f32e4714-9c03-456e-bc05-238c01bacbca) set by cmooney@cumin1002 for 0:20:00 on 1 host(s) and their services with reason: bouncing line card on cr1-codfw

ssw1-a1-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2024-07-18T16:39:30Z] <topranks> resetting line card 1/1 on cr1-codfw (T366941)

Change #1055262 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add new cr1-codfw sub-ints for rows c/d to DHCP relay and RA gen

https://gerrit.wikimedia.org/r/1055262

Change #1055262 merged by jenkins-bot:

[operations/homer/public@master] Add new cr1-codfw sub-ints for rows c/d to DHCP relay and RA gen

https://gerrit.wikimedia.org/r/1055262

Mentioned in SAL (#wikimedia-operations) [2024-07-18T17:24:48Z] <topranks> making cr1-codfw interfaces connecting ssw1-d1-codfw VRRP master for row c & d vlans T366941

Mentioned in SAL (#wikimedia-operations) [2024-07-18T17:43:49Z] <topranks> disabling cr2-codfw port et-1/1/0 connecting to asw-c-codfw T366941

Change #1055266 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Move RA generation and dhcp relay to ssw1-d facing ports

https://gerrit.wikimedia.org/r/1055266

Change #1055266 merged by jenkins-bot:

[operations/homer/public@master] Move RA generation and dhcp relay to ssw1-d facing ports

https://gerrit.wikimedia.org/r/1055266

Work completed, traffic is currently bridged through the two spine switches over the AEs from the row C/D virtual-chassis and the CR interfaces connected to the Spines are working as VRRP gateway.

Thanks @Papaul for the help on site!

GNMI stats proved very helpful to keep an eye on the bandwidth shifting around

image.png (471×959 px, 96 KB)

image.png (459×949 px, 93 KB)

Change #1055169 merged by Cathal Mooney:

[operations/puppet@production] Add monitoring checks for codfw row D spines

https://gerrit.wikimedia.org/r/1055169