Page MenuHomePhabricator

Configuration of New Switches Eqiad Rows E-F
Closed, ResolvedPublic

Description

Creating task to track the configuration of the new Juniper QFX switches installed in rows E/F in the new cage in Eqiad.

In total there are 9 switches that have been installed:

LSW1-E1
LSW1-E2
LSW1-E3
LSW1-E4

LSW1-F1
LSW1-F2
LSW1-F3
LSW1-F4
LSW1-F5

These are all QFX5120-48Y devices (48xSFP28 + 8xQSFP28 ports), and as such are named "LSW" - Leaf Switch. They should be connected to dedicated Spine devices (QFX5120-32C, 32xQSFP28 ports), however due to delays from suppliers they have not yet been delivered. As such LSW-E1 and LSW-F1 will temporarily act as aggregation / Spine switches, terminating links from the existing cage to CR routers and LVS devices.

The high-level design will be as described in the Google design doc, implemting a routed access layer and utilizing EVPN/VXLAN. Low level configuration will be completed through a process of applying base configurations with homer, adding the new elements manually and then tuning the homer templates / netbox to produce the same result.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+6 -1
operations/homer/publicmaster+790 -35
operations/homer/publicmaster+8 -10
operations/homer/publicmaster+328 -27
operations/homer/publicmaster+0 -16
operations/homer/publicmaster+18 -0
operations/homer/publicmaster+1 -3
operations/homer/publicmaster+28 -2
operations/homer/publicmaster+40 -3
operations/homer/publicmaster+12 -0
operations/homer/publicmaster+62 -0
operations/software/homer/deploymaster+1 -3
operations/puppetproduction+6 -0
operations/puppetproduction+1 -1
operations/homer/publicmaster+68 -3
operations/software/homer/deploymaster+53 -9
operations/puppetproduction+18 -6
operations/homer/publicmaster+18 -0
operations/dnsmaster+10 -0
operations/dnsmaster+10 -0
operations/dnsmaster+0 -40
operations/puppetproduction+224 -0
operations/puppetproduction+192 -0
operations/dnsmaster+180 -0
operations/software/homermaster+29 -0
operations/homer/publicmaster+6 -4
operations/homer/publicmaster+56 -3
operations/homer/publicmaster+0 -1
operations/homer/publicmaster+50 -1
operations/puppetproduction+25 -0
Show related patches Customize query in gerrit

Event Timeline

cmooney triaged this task as Medium priority.Jan 21 2022, 12:24 PM
cmooney created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Currently waiting on T299759 to be completed to gain console access to these devices and begin the process.

Just to update we've had console access for most of this week and configuration / testing is under way. Will submit CRs when config is ready.

Just to add an update here the main VXLAN/EVPN configuration has been added to the devices, and using some test servers kindly installed by dc-ops I've been able to validate that the basic fabric is working.

L2 Ping between servers on same vlan, connected to different top-of-rack switches:
root@an-worker1148-test:~# ping -I 10.64.134.11 -M do -s 9000 10.64.134.10
PING 10.64.134.10 (10.64.134.10) from 10.64.134.11 : 9000(9028) bytes of data.
9008 bytes from 10.64.134.10: icmp_seq=1 ttl=64 time=0.194 ms
9008 bytes from 10.64.134.10: icmp_seq=2 ttl=64 time=0.264 ms

Local/remote MACs as seen from that server:

root@an-worker1148-test:~# ip neigh show 10.64.134.10
10.64.134.10 dev eno2np1 lladdr e4:3d:1a:54:14:45 STALE
root@an-worker1148-test:~#
root@an-worker1148-test:~# ip -br link show eno2np1
eno2np1          UP             e4:3d:1a:54:ab:a7 <BROADCAST,MULTICAST,UP,LOWER_UP>

On the switch connecting it you can see it's own link address is visible out port xe-0/0/6, and other reachable via VTEP/VXLAN:

cmooney@lsw1-f1-eqiad> show ethernet-switching table vlan-id 1035 

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC, O - ovsdb MAC)


Ethernet switching table : 4 entries, 4 learned
Routing instance : default-switch
   Vlan                MAC                 MAC      Logical                SVLBNH/      Active
   name                address             flags    interface              VENH Index   source
   private1-f1-eqiad   00:00:5e:11:fa:ce   DRP      esi.1780               1779         05:00:00:fd:2a:00:1e:88:8b:00 
   private1-f1-eqiad   a4:e1:1a:81:3a:80   DR       vtep.32769                          10.64.128.3                   
   private1-f1-eqiad   e4:3d:1a:54:14:45   DR       vtep.32769                          10.64.128.3                   
   private1-f1-eqiad   e4:3d:1a:54:ab:a7   D        xe-0/0/6.0

The reverse is true on the far-side switch (connected to the ping destination):

cmooney@lsw1-e1-eqiad> show ethernet-switching table vlan-id 1035 

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC, O - ovsdb MAC)


Ethernet switching table : 4 entries, 4 learned
Routing instance : default-switch
   Vlan                MAC                 MAC      Logical                SVLBNH/      Active
   name                address             flags    interface              VENH Index   source
   private1-f1-eqiad   00:00:5e:11:fa:ce   DRP      esi.1797               1796         05:00:00:fd:2a:00:1e:88:8b:00 
   private1-f1-eqiad   a4:e1:1a:81:9e:80   DR       vtep.32770                          10.64.128.7                   
   private1-f1-eqiad   e4:3d:1a:54:14:45   D        xe-0/0/6.0           
   private1-f1-eqiad   e4:3d:1a:54:ab:a7   DR       vtep.32770                          10.64.128.7
L3 routing between local subnets in different racks

L3 routing via the overlay VRF/routing instance is also working between racks.

root@an-worker1148-test:~# ping -I 10.64.134.11 -M do -s 9000 10.64.130.10
PING 10.64.130.10 (10.64.130.10) from 10.64.134.11 : 9000(9028) bytes of data.
9008 bytes from 10.64.130.10: icmp_seq=1 ttl=63 time=0.214 ms
9008 bytes from 10.64.130.10: icmp_seq=2 ttl=63 time=0.251 ms
root@an-worker1148-test:~# traceroute -w 1 -n -s 10.64.134.11 10.64.130.10
traceroute to 10.64.130.10 (10.64.130.10), 30 hops max, 60 byte packets
 1  10.64.134.254  0.649 ms  0.620 ms  0.603 ms
 2  10.64.130.1  8.107 ms  8.090 ms  8.075 ms
 3  10.64.130.10  0.150 ms  0.136 ms  0.121 ms

In the trace above the first hop is the irb interface on the connected top-of-rack, second hop is IRB on remote switch connected to destination server, hop 3 is the destination itself. Packets on the wire between hop 1 and hop 2 will have been tunneled using VXLAN.

The switch receiving the packets in the above example follows this EVPN type 5 route it's learnt from the remote switch:

cmooney@lsw1-f1-eqiad> show route table PRODUCTION.inet.0 10.64.130.10 

PRODUCTION.inet.0: 15 destinations, 17 routes (15 active, 0 holddown, 0 hidden)
@ = Routing Use Only, # = Forwarding Use Only
+ = Active Route, - = Last Active, * = Both

10.64.130.0/24     *[EVPN/170] 00:42:48
                    >  to 10.64.129.6 via et-0/0/52.0
Next Steps

With this basic connectivity confirmed the next steps as follows:

  1. Review testing checklist, adding any additional checks needed based on SONiC discussion.

https://docs.google.com/spreadsheets/d/1Myz9OZkvB6dbnR8oIumaMtwGMEYd5N6gRUJdhvX_r7g

  1. Establish the link / BGP peering between the CR routers and lsw1-e1/f1 (acting as Spines for now).
  2. Test access to rest of network from hosts in new cage.
  3. Test external access from hosts in new cage.
  4. Test re-image / install / DHCP relay works as expected for hosts in the new cage
  5. Agree on how to handle Analytics servers and set up Vlans as needed.
  6. Build connectivity to the LVS servers in existing rows, test they can communicate with hosts on every vlan.
  7. Validate any other server deploys go ok / work through any niggles.

Probably more than that but that's the high level.

The automation / homer piece also needs to be completed. I've used templates to create the current configs but they probably need some review / refactoring. My main goal thus far was to get the config in place, rather than produce the most efficient automation scripts. Once I'm happy the configuration being produced is what we need I will submit patches via Gerrit, then work with Arzhel and Riccardo to finesse things until we are happy with it.

Change 759331 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add eBGP peering between CR routers and datacenter switches.

https://gerrit.wikimedia.org/r/759331

Change 759467 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Added IP ranges for new subnets Eqiad expansion cage E/F

https://gerrit.wikimedia.org/r/759467

Change 759467 merged by Cathal Mooney:

[operations/puppet@production] Added IP ranges for new subnets Eqiad expansion cage E/F

https://gerrit.wikimedia.org/r/759467

Change 759331 merged by jenkins-bot:

[operations/homer/public@master] Add eBGP peering between CR routers and datacenter switches.

https://gerrit.wikimedia.org/r/759331

Mentioned in SAL (#wikimedia-operations) [2022-02-03T11:15:24Z] <topranks> Adding BGP peering to lsw1-f1-eqiad on cr2-eqiad. T299758.

Change 759473 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Removed local-as statement on eBGP peering from CR to SPINE

https://gerrit.wikimedia.org/r/759473

Change 759473 merged by jenkins-bot:

[operations/homer/public@master] Removed local-as statement on eBGP peering from CR to SPINE

https://gerrit.wikimedia.org/r/759473

Change 759500 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add inbound and outbound BGP filters on CR to SPINE eBGP sessions

https://gerrit.wikimedia.org/r/759500

Change 759500 merged by jenkins-bot:

[operations/homer/public@master] Add inbound and outbound BGP filters on CR to SPINE eBGP sessions

https://gerrit.wikimedia.org/r/759500

Change 759503 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Adjust CR templates so BGP_Switch_In doesn't reference K8s policy.

https://gerrit.wikimedia.org/r/759503

Change 759503 merged by jenkins-bot:

[operations/homer/public@master] Adjust CR templates so BGP_Switch_In doesn't reference K8s policy.

https://gerrit.wikimedia.org/r/759503

Change 759707 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer@master] Add new function to return device 'underlay' network links.

https://gerrit.wikimedia.org/r/759707

Change 759709 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Base config additions and updated tempaltes to configure EVPN ASW

https://gerrit.wikimedia.org/r/759709

Change 759707 abandoned by Cathal Mooney:

[operations/software/homer@master] Add new function to return device 'underlay' network links.

Reason:

Logic should be in the WMF-specific module.

https://gerrit.wikimedia.org/r/759707

Change 760566 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] Add new function to return device 'underlay' network links.

https://gerrit.wikimedia.org/r/760566

Change 761473 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Adding includes for Netbox-generated zone files for new eqiad subnets

https://gerrit.wikimedia.org/r/761473

Change 761473 merged by Cathal Mooney:

[operations/dns@master] Adding includes for Netbox-generated zone files for new eqiad subnets

https://gerrit.wikimedia.org/r/761473

Change 763766 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add new Eqiad private and analytics subnets to dhcp.conf

https://gerrit.wikimedia.org/r/763766

Change 763766 merged by Cathal Mooney:

[operations/puppet@production] Add new Eqiad private and analytics subnets to dhcp.conf

https://gerrit.wikimedia.org/r/763766

Change 764355 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add per-subnet netboot conf files for new row E-F subnets in Eqiad

https://gerrit.wikimedia.org/r/764355

Change 764355 merged by Cathal Mooney:

[operations/puppet@production] Add per-subnet netboot conf files for new row E-F subnets in Eqiad

https://gerrit.wikimedia.org/r/764355

Change 764791 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Adding more new LEAF switches from Eqiad rows E/F to monitoring

https://gerrit.wikimedia.org/r/764791

Change 766832 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Remove authdns includes for reverse zones Eqiad rack E4/F4 subnets

https://gerrit.wikimedia.org/r/766832

Change 766832 merged by Cathal Mooney:

[operations/dns@master] Remove authdns includes for reverse zones Eqiad rack E4/F4 subnets

https://gerrit.wikimedia.org/r/766832

Change 767562 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Adding includes for Netbox-generated zone files for eqiad evpn lb

https://gerrit.wikimedia.org/r/767562

Change 767570 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add site variable for EVPN overlay loopback subnets and CR filter

https://gerrit.wikimedia.org/r/767570

Change 767562 abandoned by Cathal Mooney:

[operations/dns@master] Adding includes for Netbox-generated zone files for eqiad evpn lb

Reason:

Will resubmit messed up in git.

https://gerrit.wikimedia.org/r/767562

Change 767748 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Try 2 to add Netbox-generated zone files for eqiad evpn loopbacks

https://gerrit.wikimedia.org/r/767748

Change 767748 merged by Cathal Mooney:

[operations/dns@master] Try 2 to add Netbox-generated zone files for eqiad evpn loopbacks

https://gerrit.wikimedia.org/r/767748

Change 767772 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Remove puppet subnet definitions for private subnets racke E4/F4

https://gerrit.wikimedia.org/r/767772

Change 767570 merged by jenkins-bot:

[operations/homer/public@master] Add EVPN overlay loopback subnets to CR BGP policy to switches

https://gerrit.wikimedia.org/r/767570

Change 767772 merged by Cathal Mooney:

[operations/puppet@production] Add subnet definitions for new Analytics vlans to netops data.yaml

https://gerrit.wikimedia.org/r/767772

Change 767862 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add several ASNs to those that alert as critical from Icinga

https://gerrit.wikimedia.org/r/767862

Change 769478 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Initial changes to Homer config and templates for EVPN switches Eqiad

https://gerrit.wikimedia.org/r/769478

Change 760566 merged by Cathal Mooney:

[operations/software/homer/deploy@master] New function and changes to wmf-netbox plugin to support EVPN config.

https://gerrit.wikimedia.org/r/760566

Change 769478 merged by Cathal Mooney:

[operations/homer/public@master] Initial changes to Homer config and templates for EVPN switches Eqiad

https://gerrit.wikimedia.org/r/769478

Change 767862 merged by Cathal Mooney:

[operations/puppet@production] Add several ASNs to those that alert as critical from Icinga

https://gerrit.wikimedia.org/r/767862

Change 769950 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add new QFX switches in Eqiad row E/F to rancid for config backup

https://gerrit.wikimedia.org/r/769950

Change 769950 merged by Cathal Mooney:

[operations/puppet@production] Add new QFX switches in Eqiad row E/F to rancid for config backup

https://gerrit.wikimedia.org/r/769950

Change 770966 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] Change _get_underlay_ints() to use fetch_device_interfaces()

https://gerrit.wikimedia.org/r/770966

Change 770966 merged by Cathal Mooney:

[operations/software/homer/deploy@master] Change _get_underlay_ints() to use fetch_device_interfaces()

https://gerrit.wikimedia.org/r/770966

Change 771461 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add ACL filter to Spine switch interface connecting CR routers Eqiad

https://gerrit.wikimedia.org/r/771461

Change 771461 abandoned by Cathal Mooney:

[operations/homer/public@master] Add ACL filter to Spine switch interface connecting CR routers Eqiad

Reason:

Most definitely not the right approach to this.

https://gerrit.wikimedia.org/r/771461

Change 772868 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add new Analytics subnets to static Capirca net definitions

https://gerrit.wikimedia.org/r/772868

Change 772868 merged by jenkins-bot:

[operations/homer/public@master] Add new Analytics subnets to static Capirca net definitions

https://gerrit.wikimedia.org/r/772868

Change 773587 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add template to configure IPv6 RAs on CRs and L3 Switches

https://gerrit.wikimedia.org/r/773587

Change 777855 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad

https://gerrit.wikimedia.org/r/777855

Change 777855 merged by jenkins-bot:

[operations/homer/public@master] Add inbound filter to analytics IRB interfaces on EVPN switches Eqiad

https://gerrit.wikimedia.org/r/777855

Change 773587 merged by jenkins-bot:

[operations/homer/public@master] Add template to configure IPv6 RAs on CRs and L3 Switches

https://gerrit.wikimedia.org/r/773587

Change 779100 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Modify homer automation for IPv6 RAs to allow for custom interfaces

https://gerrit.wikimedia.org/r/779100

Change 779100 merged by jenkins-bot:

[operations/homer/public@master] Modify homer automation for IPv6 RAs to allow for custom interfaces

https://gerrit.wikimedia.org/r/779100

Change 779101 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove IPv6 RA config on cr2-drmrs fxp0.0

https://gerrit.wikimedia.org/r/779101

Change 779101 merged by jenkins-bot:

[operations/homer/public@master] Remove IPv6 RA config on cr2-drmrs fxp0.0

https://gerrit.wikimedia.org/r/779101

Change 779844 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Remove config/var for defining bespoke interfaces for IPv6 RAs

https://gerrit.wikimedia.org/r/779844

Change 779844 merged by jenkins-bot:

[operations/homer/public@master] Remove config/var for defining bespoke interfaces for IPv6 RAs

https://gerrit.wikimedia.org/r/779844

Change 786296 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add automation templates for EVPN switch overlay BGP

https://gerrit.wikimedia.org/r/786296

Change 786296 merged by jenkins-bot:

[operations/homer/public@master] Add automation templates for EVPN switch overlay BGP

https://gerrit.wikimedia.org/r/786296

Change 789597 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Minor fixes to ASW EVPN templates

https://gerrit.wikimedia.org/r/789597

Change 789597 merged by jenkins-bot:

[operations/homer/public@master] Minor fixes to ASW EVPN templates

https://gerrit.wikimedia.org/r/789597

Change 759709 abandoned by Cathal Mooney:

[operations/homer/public@master] Base config additions and updated templates to configure EVPN ASW

Reason:

https://gerrit.wikimedia.org/r/759709

Change 764791 abandoned by Ayounsi:

[operations/puppet@production] Adding more new LEAF switches from Eqiad rows E/F to monitoring

Reason:

Taking the liberty to abandon it to clean my gerrit dashboard as it has been done differently.

https://gerrit.wikimedia.org/r/764791