Page MenuHomePhabricator

Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches
Closed, ResolvedPublic

Description

Summary

This task will track the work to connect the new Nokia switches in eqiad row C to the existing Juniper devices in that row, and bridge the existing vlans on the old switches through to the Nokias. As we will re-use the ports currently connecting asw2-c-eqiad to our core routers for this (we have no others free), it will also involve moving the VRRP GW on the CRs from the current port (connected to asw2-c-eqiad) to et-1/0/5 (connected to Nokia Spines).

When complete the traffic path for hosts in row C will have changed FROM asw2-c-eqiad -> crX-eqiad TO asw2-c-eqiad -> ssw1-dX-eqiad -> crX-eqiad.

Traffic to end hosts should not be disrupted during the move, though it is a delicate operation and we need to move carefully and check at all times that things are ok.

We will follow the same process for all vlans. There are three vlans we need to consider:

1019 - private1-c-eqiad (vrrp)
1003 - public1-c-eqiad (vrrp)
1022 - analytics1-c-eqiad (vrrp)

The main things to watch when doing it are our alerts dashboard, eqiad throughput dashboard, and the #opterations and #sre channels on irc.

Phase 1 - Migrate asw2-c2-eqiad et-7/0/53 -> ssw1-d1-eqiad ethernet-1/28

Step 1 - Verify VRRP status on the CRs

The Netbox configuration for all three VRRP groups is set so that cr2-eqiad is primary in VRRP for all three subnets. But just to be sure we want to connect to both CRs and verify with

show vrrp summary | match ae3

Step 2 - Shut down et-1/1/0 on cr1-eqiad

Next we want to shut down the port on cr1-eqiad which connects to asw2-c2-eqiad. This will mean the 'direct' / 'connected' route to that vlan will disappear on cr1-eqiad, and instead it should install the route to those destinations it learns in OSPF from cr2-eqiad.

deactivate interface et-1/1/0

Once done we want to verify that routes to all destinations are still there, going to cr2:

show route terse table inet.0 exact 10.64.32.0/22
show route terse table inet.0 exact 208.80.154.64/26
show route terse table inet.0 exact 10.64.36.0/24
show route table inet6.0 exact 2620:0:861:3::/64
show route table inet6.0 exact 2620:0:861:103::/64
show route table inet6.0 exact 2620:0:861:106::/64

We also want to check graphs that connectivity seems ok, and do some traceroutes from hosts in row A (which has VRRP GW as cr1-eqiad so will use the routes shown from above commands). Some hosts in row A we can test from include:

wikikube-worker1240
db1151
pki1001
maps1005
cp1102
dns1004
idp1004
apt1002
gitlab1003
an-conf1004
an-worker1118

Step 3: adjust netbox connection for asw2-c2-eqiad et-2/0/53 and run homer

Now that the CR port connected to asw2-c2-eqiad et-2/0/53 is down we can reconfigure it. In Netbox we need to:

  1. Adjust the cable so it shows it connected to ssw1-d1-eqiad ethernet-1/28 instead
  2. Make it a member of ae0 instead of ae1
  3. Enable the ae0 interface
  4. Delete the ae1 interface

After which we can run homer against asw2-c2-eqiad to update the port's AE membership and description.

Step 4: re-cable asw2-c2-eqiad et-2/0/53 to ssw1-d1-eqiad ethernet1/28

Now that the CR port is disabled we can move the optic from the CR to the Nokia spine port, and re-terminate the fibre link on it.

Step 5: validate we see MAC addresses on the lag1 interface of ssw1-d1-eqiad

We should see MAC addresses learnt on the various vlans:

show network-instance vlan-1019 bridge-table mac-table all
show network-instance vlan-1003 bridge-table mac-table all
show network-instance vlan-1022 bridge-table mac-table all

We should then repeat these commands on ssw1-d8-eqiad, verifying the MAC addresses are being distributed in BGP EVPN within the Nokia cluster.

Step 6: Move cr1-eqiad ae3 sub-interfaces to et-1/0/5 in Netbox and run homer

At this point we can move the ae3.X sub-interfaces in netbox from the 'ae3 LAG to port et-1/0/5 (connected to ssw1-d1-eqiad). We should double check the VRRP group updates as expected when the interfaces are renamed.

When done we can run Homer against cr1-eqiad to enable the new sub-interfaces.

Step 7: verify L3 connectivity from row C hosts to cr1-eqiad

We should now be able to ping the various IPs configured on the moved sub-interfaces on cr1-eqiad. Some suggestions for hosts to test from are:

VlanIPs to pingHosts to source pings
1019 - private1-c-eqiad10.64.32.2 & 2620:0:861:103:fe00::1es1045, wikikube-worker1063, db1242
1003 - public1-c-eqiad208.80.154.66 & 2620:0:861:3:fe00::1dns1006, alert1002, lists1004
1022 - analytics1-c-eqiad10.64.36.2 & 2620:0:861:106:fe00::1an-conf1006, an-worker1131, stat1011

Step 8: Flip VRRP on the CRs so to make cr1-eqiad the active GW

At this point we have connectivity to both CR routers on all the vlans again. To cr2-eqiad as things were, directly from asw2-c-eqiad, and to cr1-eqiad from asw2-c-eqiad -> ssw1-d1-eqiad -> cr1-eqiad.

In this step we will change the VRRP priority for all three vlans so they take this new path via the Nokia spine switch. The VRRP groups below should be modified, changing the priority for cr1-eqiad to 200:

1019 - private1-c-eqiad (vrrp)
1003 - public1-c-eqiad (vrrp)
1022 - analytics1-c-eqiad (vrrp)

With it changed in Netbox we can run Homer against cr1-eqiad to promote it to master. Once in place we can validate on both CRs:

show vrrp summary | match "ae3|et-1/0/5"

Provided it is master we should can look at this graph to validate that the traffic has flipped from one device to the other, and it is the same order of magnitude as before.

We should check from the same hosts as in the last step that comms are ok to devices outside the current vlan (some are listed in step 2, and on the public vlan we can ping internet destinations).

Phase 2 - Migrate asw2-c7-eqiad et-7/0/49 -> ssw1-d8-eqiad ethernet-1/28

The status at this point is we have one of the links moved, and outbound traffic is flowing through the Nokia spine and out to cr1-eqiad. Next we need to move the other uplink from asw2-c-eqiad, effectively repeating the process for that link.

Step 1: Shut down et-1/1/0 on cr2-eqiad

As before we want to deactivate the interface, then confirm it still knows a route to the various subnets in OSPF from cr1:

deactivate interface et-1/1/0
show route terse table inet.0 exact 10.64.32.0/22
show route terse table inet.0 exact 208.80.154.64/26
show route terse table inet.0 exact 10.64.36.0/24
show route table inet6.0 exact 2620:0:861:3::/64
show route table inet6.0 exact 2620:0:861:103::/64
show route table inet6.0 exact 2620:0:861:106::/64

We should check from a variety of hosts in row D (which use cr2-eqiad as VRRP master) that they can reach hosts in row C, example hosts to source pings are:

es1052
restbase1042
wikikube-worker1034
aqs1019
wikikube-worker1163

Step 3: adjust netbox connection for asw2-c7-eqiad et-2/0/49 and run homer

Now that the CR port connected to asw2-c2-eqiad et-2/0/53 is down we can reconfigure it. In Netbox we need to:

  1. Adjust the cable so it shows it connected to ssw1-d8-eqiad ethernet-1/28 instead
  2. Make it a member of ae0 instead of ae2
  3. Enable the ae0 interface
  4. Delete the ae2 interface

After which we can run homer against asw2-c2-eqiad to update the port's AE membership and description.

Step 4: re-cable asw2-c2-eqiad et-2/0/49 to ssw1-d8-eqiad ethernet1/28

Now that the CR port is disabled we can move the optic from the CR to the Nokia spine port, and re-terminate the fibre link on it.

Step 5: validate we see MAC addresses on the lag1 interface of ssw1-d8-eqiad

Firstly verify the LAG looks healthy on both Nokia spines:

show system network-instance ethernet-segments LAG1

We should see MAC addresses learnt on the various vlans:

show network-instance vlan-1019 bridge-table mac-table all
show network-instance vlan-1003 bridge-table mac-table all
show network-instance vlan-1022 bridge-table mac-table all

Check that the ESI type routes and MAC addresses (type 2) learnt on the LAG port on ssw1-d8-eqiad are being announced in BGP and received on ssw1-d1-eqiad:

show network-instance default protocols bgp routes evpn route-type 1 summary
show network-instance default protocols bgp routes evpn route-type 4 summary
show network-instance default protocols bgp routes evpn route-type 2 summary

Step 6: Move cr2-eqiad ae3 sub-interfaces to et-1/0/5 in Netbox and run homer

At this point we can move the ae3.X sub-interfaces in netbox from the 'ae3 LAG to port et-1/0/5 (connected to ssw1-d1-eqiad). We should double check the VRRP group updates as expected when the interfaces are renamed.

When done we can run Homer against cr1-eqiad to enable the new sub-interfaces.

Step 7: verify L3 connectivity from row C hosts to cr2-eqiad

We should now be able to ping the various IPs configured on the moved sub-interfaces on cr1-eqiad. Some suggestions for hosts to test from are:

VlanIPs to pingHosts to source pings
1019 - private1-c-eqiad10.64.32.3 & 2620:0:861:103:fe00::2es1045, wikikube-worker1063, db1242
1003 - public1-c-eqiad208.80.154.67 & 2620:0:861:3:fe00::2dns1006, alert1002, lists1004
1022 - analytics1-c-eqiad10.64.36.3 & 2620:0:861:106:fe00::2an-conf1006, an-worker1131, stat1011

Step 7: Flip VRRP on the CRs so to make cr2-eqiad the active GW again

At this point we have connectivity to both CR routers on all the vlans again. To cr2-eqiad as things were, directly from asw2-c-eqiad, and to cr1-eqiad from asw2-c-eqiad -> ssw1-d1-eqiad -> cr1-eqiad.

In this step we will change the VRRP priority for all three vlans so they take this new path via the Nokia spine switch. The VRRP groups below should be modified, changing the priority for cr1-eqiad back to 90:

1019 - private1-c-eqiad (vrrp)
1003 - public1-c-eqiad (vrrp)
1022 - analytics1-c-eqiad (vrrp)

With it changed in Netbox we can run Homer against cr2-eqiad to promote it to master. Once in place we can validate on both CRs:

show vrrp summary | match "et-1/0/5"

Provided it is master we can look at this graph to validate that the traffic has flipped from one device to the other, and it is the same order of magnitude as before.

We should check from the same hosts as in the last step that comms are ok to devices outside the current vlan (some are listed in step 2, and on the public vlan we can ping internet destinations).

Phase 3 - Cleanup
  1. Delete ae3 and sub-interfaces from cr1-eqiad and disable port et-1/1/0
  2. Delete ae3 and sub-interfaces from cr2-eqiad and disable port et-1/1/0

Event Timeline

cmooney triaged this task as Medium priority.

After discussing this with @cmooney over IRC, I reviewed the moves on the Eqiad side and noted that we had one free port on the C2 and C7 ASW switches. To allow for easier rerunning and less juggling on the day of the move, I ran these cables using spare SR4 optics.

The connections are as follows:
asw2-c2-eqiad et52 is connected to ssw1-d1-eqiad ethernet-1/28
asw2-c7-eqiad et49 is connected to ssw1-d8-eqiad ethernet-1/28.

After discussing this with @cmooney over IRC, I reviewed the moves on the Eqiad side and noted that we had one free port on the C2 and C7 ASW switches. To allow for easier rerunning and less juggling on the day of the move, I ran these cables using spare SR4 optics.

The connections are as follows:
asw2-c2-eqiad et52 is connected to ssw1-d1-eqiad ethernet-1/28
asw2-c7-eqiad et49 is connected to ssw1-d8-eqiad ethernet-1/28.

Thanks @Jclark-ctr this is a huge help!

I should have checked, in codfw we had no free 40G ports on any of the old switches, and I assumed the same was true in eqiad. I think the fact we only have 6 production racks in row C eqiad makes the difference which meant these were free.

I'll tackle this task step-by-step over the coming week which will be a lot easier than the "big bang" approach. I'll raise a similar task for row D - which we will need to do similar to the above steps - but we'll have extra confidence for that if row C working with the Nokias.

Mentioned in SAL (#wikimedia-operations) [2025-10-20T13:54:59Z] <topranks> enable 2x40G lag from asw2-c-eqiad to ssw1-dX-eqiad T405579

Mentioned in SAL (#wikimedia-operations) [2025-11-03T11:58:01Z] <topranks> move analytics1-c-eqiad gateway IPs to new spine switch ports eqiad T405579

Mentioned in SAL (#wikimedia-operations) [2025-11-03T12:27:30Z] <topranks> adjust VRRP priority for analytics1-d-eqiad to make cr1-eqiad active gateway T405579

Mentioned in SAL (#wikimedia-operations) [2025-11-03T12:35:02Z] <topranks> move analytics1-c-eqiad gateway IPs to new spine switch port cr2-eqiad T405579

Change #1201056 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Eqiad C/D migration: move analytics1-c-eqiad GW to CR et-1/0/5

https://gerrit.wikimedia.org/r/1201056

Change #1201056 merged by jenkins-bot:

[operations/homer/public@master] Eqiad C/D migration: move analytics1-c-eqiad GW to CR et-1/0/5

https://gerrit.wikimedia.org/r/1201056

Mentioned in SAL (#wikimedia-operations) [2025-11-06T14:00:06Z] <topranks> move public1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr2-eqiad (T405579)

Mentioned in SAL (#wikimedia-operations) [2025-11-06T14:07:51Z] <topranks> move public1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr1-eqiad (T405579)

Mentioned in SAL (#wikimedia-operations) [2025-11-06T14:20:01Z] <topranks> move private1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr2-eqiad (T405579)

Mentioned in SAL (#wikimedia-operations) [2025-11-06T14:27:45Z] <topranks> move private1-c-eqiad sub-interface from ae3 to et-1/0/5 on cr1-eqiad (T405579)

Change #1202729 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Eqiad row c: move vlan gateways to ports facing the Nokia spines

https://gerrit.wikimedia.org/r/1202729

This is now complete. For now we will leave things as they are and tackle the migration of the IP gateways to the switch layer once we have full confidence in the Nokia devices.

Change #1202729 merged by jenkins-bot:

[operations/homer/public@master] Eqiad row c: move vlan gateways to ports facing the Nokia spines

https://gerrit.wikimedia.org/r/1202729