Page MenuHomePhabricator

ULSFO: New switch configuration
Closed, ResolvedPublic

Description

Below is the configuration We will be user on the new switches and some modification that needs to be done on the core routers and the mgmt router as well

  • IP addressing core
Device side AInterfaceIPV4IPV6Device side BInterfaceIPV4IPV6Comments
cr3-ulsfoet-0/0/1198.35.26.142/312620:0:863:fe03::1/64asw1-22-ulsfoethernet-1/55198.35.26.143/312620:0:863:fe03::2/64config done on asw
cr3-ulsfoet-0/0/2198.35.26.148/312620:0:863:fe09::1/64asw1-23-ulsfoethernet-1/55198.35.26.149/312620:0:863:fe09::2/64config done on asw
cr4-ulsfoet-0/0/1198.35.26.146/312620:0:863:fe0a::1/64asw1-22-ulsfoethernet-1/56198.35.26.147/312620:0:863:fe0a::2/64config done on asw
cr4-ulsfoet-0/0/2198.35.26.144/312620:0:863:fe0b::1/64asw1-23-ulsfoethernet-1/56198.35.16.145/312620:0:863:fe0b::2/64config done on asw
cr3-ulsfoet-0/0/0198.35.26.136/312620:0:863:fe00::1/64cr4-ulsfoet-0/0/0198.35.26.137/32620:0:863:fe00::2/64
mr1-ulsfoge-0/0/310.128.127.3/312620:0:863:fe05::2/64asw1-22-ulsfoethernet-1/4810.128.127.2/312620:0:863:fe05::1/64asw1-22 will take cr3 IPV6,mr1 keeps the same IPV6
mr1-ulsfoge-0/0/410.128.127.5/312620:0:863:fe06::2/64asw1-23-ulsfoethernet-1/4810.128.127.4/312620:0:863:fe06::1/64(asw1-23 will take cr4 IPV6, mr1 keeps the same IPV6)
  • Setup BGP on asw1-22 to core routers
  • Setup BGP on asw1-23 to core routers
  • Setup BGP on cr3 to asw1-22/23
  • Setup BGP on cr4 to asw1-22/23
  • Setup BGP on mr1 to asw1-22/23
  • Note: Change et-0/0/1 speed to 100g on each routers after disconnecting it from the old switch
  • Add both switches to monitoring
  • IP addressing loopback
  • cr3-ulsfo 198.35.26.128/32 - 2620:0:863:ffff::1/128
  • cr4-ulsfo 198.35.26.129/32 - 2620:0:863:ffff::2/128
  • mr1-ulsfo 198.35.26.130/32 - 2620:0:863:ffff::3/128
  • asw1-22-ulsfo 198.35.26.131/32 - 2620:0:863:ffff::4/128
  • asw1-23-ulsfo 198.35.26.132/32 - 2620:0:863:ffff::5/128
  • irb configuration

Right now the default gateway is setup on the routers since we are using a Virtual chassis design.We will be moving the default gateway down to the switches

on asw1-22-ulsfo

  • Create irb.411 public1-22-ulsfo = 198.35.26.1/27 2620:0:863:1::1/64 and change the /28 in Netbox to /27
  • Create irb.421 private1-22-ulsfo = 10.128.0.1/24 2620:0:863:101::1/64

on asw1-23-ulsfo

  • Create irb.412 public1-23-ulsfo = 198.35.26.33/27 2620:0:863:2::1/64
  • Create irb.422 private1-23-ulsfo = 10.128.1.1/24 2620:0:863:102::1/64

-Some renaming
Private IPV4

  • private1-ulsfo = 10.128.0.0/24 rename the vlan to private1-22-ulsfo vlan id 421
  • create private1-23-ulsfo and assign it the prefix 10.128.1.0/24 vlan id 422

Private IPV6

  • private1-ulsfo = 2620:0:863:101::/64 keep this rename the vlan to private1-22-ulsfo with vlan id 421
  • create private1-23-ulsfo and assign it the prefix 2620:0:863:102::/64 with vlan id 422

Public IPV4

  • public1-ulsfo = 198.35.26.0/28 change this to 198.35.26.0/27 and rename the vlan to public1-22-ulsfo with vlan id 411
  • create public1-23-ulsfo and assign it the prefix 198.35.26.32/27 vlan id 412

Public IPV6

  • public1-ulsfo = 2620:0:863:1::/64 keep this rename the vlan to public1-22-ulsfo with vlan id 411
  • create public1-23-ulsfo and assign it the prefix 2620:0:863:2::/64 vlan id 412
  • homer configuration

Devices.yaml

cr3-ulsfo.wikimedia.org:
  config:
  |
  |
    device_bgp:
      sw_mr: # To be removed #
        mr1-ulsfo: {4: 198.35.26.199, 6: 2620:0:863:fe05::2} # To be removed#
      cr_switch:
        asw1-22-ulsfo: {4: 198.35.26.143, 6: 2620:0:863:fe03::2, peer_as: 4265004001}
        asw1-23-ulsfo: {4: 198.35.26.149, 6: 2620:0:863:fe09::2, peer_as: 4265004002}

cr4-ulsfo.wikimedia.org:
  config:
  |
  |
    device_bgp:
      sw_mr: # To be removed #
        mr1-ulsfo: {4: 198.35.26.201, 6: 2620:0:863:fe06::2} # To be removed #
      cr_switch:
        asw1-22-ulsfo: {4: 198.35.26.147, 6: 2620:0:863:fe0a::2, peer_as: 4265004001}
        asw1-23-ulsfo: {4: 198.35.16.145, 6: 2620:0:863:fe0b::2, peer_as: 4265004002}

asw1-22-ulsfo.mgmt.ulsfo.wmnet:
  config:
    asn: 4265004001
    capirca:
      -  srl-common-loopback
    device_bgp:
      sw_mr:
        mr1-ulsfo: {4: 10.128.127.3, 6: 2620:0:863:fe05::2}
      sw_external:
        cr3-ulsfo: {4: 198.35.26.142, 6: 2620:0:863:fe03::1}
        cr4-ulsfo: {4: 198.35.26.146, 6: 2620:0:863:fe0a::1}

asw1-23-ulsfo.mgmt.ulsfo.wmnet:
  config:
    asn: 4265004002
    capirca:
      -  srl-common-loopback
    device_bgp:
      sw_mr:
        mr1-ulsfo: {4: 10.128.127.5, 6: 2620:0:863:fe06::2}
      sw_external:
        cr3-ulsfo: {4: 198.35.26.148, 6: 2620:0:863:fe09::1 }
        cr4-ulsfo: {4: 198.35.26.144, 6: 2620:0:863:fe0b::1}

#### mr1-ulsfo configuration ####

mr1-ulsfo.wikimedia.org:
  timeout: 120
  config:
    security_zones:
      - name: production
        services: ['ssh', 'ping', 'traceroute', 'snmp', 'bgp']
        interfaces: ['lo0.0', 'ge-0/0/3', 'ge-0/0/4']
      - name: untrust
        services: ['ssh', 'ping', 'traceroute']
        interfaces: ['ge-0/0/0']
      - name: mgmt
        services: ['ssh', 'ping', 'traceroute', 'dhcp']
        interfaces: ['irb.900']
    capirca:
      - mr-security-policies
    device_bgp:
      mr_sw:
        cr3-ulsfo: {4: 198.35.26.198, 6: 2620:0:863:fe05::1, peer_as: 14907} # To be removed #
        cr4-ulsfo: {4: 198.35.26.200, 6: 2620:0:863:fe06::1, peer_as: 14907} # To be removed #
        asw1-22-drmrs: {4: 10.128.127.2, 6: 2620:0:863:fe05::1, peer_as: 4265004001}
        asw1-23-drmrs: {4: 10.128.127.4, 6: 2620:0:863:fe06::1, peer_as: 4265004002}


----------------------------------------------------------------

`

  • Configuration generated in my LAB
  • BGP verification in my LAB
  • asw1-22-ulsfo
asw1-22-ulsfo> show bgp summary 
Groups: 1 Peers: 4 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               
                      26         15          0          0          0          0
inet6.0              
                      22         11          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
198.35.26.142         14907        104         95       0       0       41:48 Establ
  inet.0: 7/13/13/0
198.35.26.146         14907        102         94       0       0       41:49 Establ
  inet.0: 8/13/13/0
2620:0:863:fe03::1       14907        105         94       0       0       41:38 Establ
  inet6.0: 5/11/11/0
2620:0:863:fe0a::1       14907        104         94       0       0       41:35 Establ
  inet6.0: 6/11/11/0
  • asw1-23-ulsfo
sw1-23-ulsfo> show bgp summary 
Threading mode: BGP I/O
Groups: 1 Peers: 4 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               
                      28         15          0          0          0          0
inet6.0              
                      24         13          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
198.35.26.144         14907         83         77       0       0       33:25 Establ
  inet.0: 8/14/14/0
198.35.26.148         14907         83         77       0       0       33:21 Establ
  inet.0: 7/14/14/0
2620:0:863:fe09::1       14907         86         76       0       0       33:10 Establ
  inet6.0: 6/12/12/0
2620:0:863:fe0b::1       14907         87         77       0       0       33:14 Establ
  inet6.0: 7/12/12/0
  • cr3-ulsfo
cr3-ulsfo-dfw# run show bgp summary group Switch 
Threading mode: BGP I/O
Default eBGP mode: advertise - accept, receive - accept
Groups: 6 Peers: 13 Down peers: 7
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               
                      11          6          0          0          0          0
inet6.0              
                      10          3          0          0          0          0
inet.2               
                       0          0          0          0          0          0
inet6.2              
                       0          0          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
198.35.26.143    4265004001         86         92       0       0       37:13 Establ
  inet.0: 3/3/3/0
198.35.26.149    4265004002         71         75       0       0       30:23 Establ
  inet.0: 3/3/3/0
2620:0:863:fe03::2  4265004001         85         93       0       0       37:03 Establ
  inet6.0: 2/4/4/0
2620:0:863:fe09::2  4265004002         71         79       0       0       30:12 Establ
  inet6.0: 1/3/3/0
  • cr4-ulsfo
cr4-ulsfo-dfw> show bgp summary group Switch 
Threading mode: BGP I/O
Default eBGP mode: advertise - accept, receive - accept
Groups: 6 Peers: 13 Down peers: 7
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               
                      11          6          0          0          0          0
inet6.0              
                      10          3          0          0          0          0
inet.2               
                       0          0          0          0          0          0
inet6.2              
                       0          0          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped...
198.35.26.145    4265004002        170        173       0       0     1:15:00 Establ
  inet.0: 3/3/3/0
198.35.26.147    4265004001        184        188       0       0     1:21:47 Establ
  inet.0: 3/3/3/0
2620:0:863:fe0a::2  4265004001        183        190       0       0     1:21:33 Establ
  inet6.0: 2/4/4/0
2620:0:863:fe0b::2  4265004002        170        177       0       0     1:14:49 Establ
  inet6.0: 1/3/3/0
  • Verify asw1-22 can see the 2 new networks on asw1-23 and can reach them
asw1-22-ulsfo> show route 10.128.1.0/24 

inet.0: 23 destinations, 39 routes (23 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

10.128.1.0/24      *[BGP/170] 00:05:05, MED 0, localpref 100
                      AS path: 14907 4265004002 I, validation-state: unverified
                      to 198.35.26.146 via xe-0/0/10.0
                    > to 198.35.26.142 via xe-0/0/11.0
                    [BGP/170] 01:37:52, MED 0, localpref 100
                      AS path: 14907 4265004002 I, validation-state: unverified
                    > to 198.35.26.146 via xe-0/0/10.0

{master:0}
ppaul@asw1-22-ulsfo> show route 198.35.26.32/27  

inet.0: 23 destinations, 39 routes (23 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

198.35.26.32/27    *[BGP/170] 00:05:25, MED 0, localpref 100
                      AS path: 14907 4265004002 I, validation-state: unverified
                      to 198.35.26.146 via xe-0/0/10.0
                    > to 198.35.26.142 via xe-0/0/11.0
                    [BGP/170] 00:06:43, MED 0, localpref 100
                      AS path: 14907 4265004002 I, validation-state: unverified
                    > to 198.35.26.146 via xe-0/0/10.0
asw1-22-ulsfo> ping 10.128.1.1  
PING 10.128.1.1 (10.128.1.1): 56 data bytes
64 bytes from 10.128.1.1: icmp_seq=0 ttl=63 time=115.754 ms
64 bytes from 10.128.1.1: icmp_seq=1 ttl=63 time=110.220 ms

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+3 -15
operations/dnsmaster+23 -20
operations/dnsmaster+4 -0
operations/dnsmaster+4 -0
operations/puppetproduction+6 -3
operations/puppetproduction+0 -8
operations/homer/publicmaster+0 -4
operations/puppetproduction+9 -3
operations/puppetproduction+6 -0
operations/puppetproduction+15 -6
operations/dnsmaster+12 -10
operations/puppetproduction+1 -4
operations/puppetproduction+0 -48
operations/homer/publicmaster+4 -8
operations/homer/publicmaster+6 -0
operations/homer/publicmaster+8 -0
operations/homer/publicmaster+82 -4
operations/homer/publicmaster+232 -0
operations/homer/publicmaster+237 -0
operations/puppetproduction+1 -1
operations/homer/publicmaster+8 -0
operations/homer/publicmaster+19 -11
operations/dnsmaster+6 -24
operations/homer/publicmaster+3 -3
operations/puppetproduction+7 -7
operations/homer/publicmaster+3 -3
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Depooling command: (depool datacenter and depool dns4003)

$ ssh cumin1003.eqiad.wmnet
$ sudo cookbook sre.dns.admin depool ulsfo -t 408892 -r "New switch configuration"
$ sudo confctl --reason 'ulsfo switch refresh T408892' select 'cluster=dnsbox,dc=ulsfo' set/pooled=no

Depool dns4003 to avoid breaking authdns-update during downtime.

@ssingh important note:
The public subnet mask for servers in rack 103.02.22 will be changing for /28 to /27 so will will have to manually change the subnet mask of dns4003 (198.35.26.8/28) which is the only host on public VLAN in that rack.

Thanks.

@RobH Remote hands instructions are ready @ https://docs.google.com/document/d/1EW6hxHCQjXPy1PXQWluwOTnCl_AHddI34iOYHdJuvek/edit?tab=t.0
Please review and let me know if all good and i can open a ticket and submit just the second and third stage steps. Thanks

Change #1279501 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/homer/public@master] Add BGP peering from asw1-23 to core routers and mr1

https://gerrit.wikimedia.org/r/1279501

Change #1279501 merged by Papaul:

[operations/homer/public@master] Add BGP peering from asw1-23 to core routers and mr1

https://gerrit.wikimedia.org/r/1279501

@RobH Remote hands instructions are ready @ https://docs.google.com/document/d/1EW6hxHCQjXPy1PXQWluwOTnCl_AHddI34iOYHdJuvek/edit?tab=t.0
Please review and let me know if all good and i can open a ticket and submit just the second and third stage steps. Thanks

I've reviewed it once and it checks out to me, going to do a secondary review of it in about 30 minutes to ensure I didn't miss anything.

@RobH Remote hands instructions are ready @ https://docs.google.com/document/d/1EW6hxHCQjXPy1PXQWluwOTnCl_AHddI34iOYHdJuvek/edit?tab=t.0
Please review and let me know if all good and i can open a ticket and submit just the second and third stage steps. Thanks

I've reviewed it once and it checks out to me, going to do a secondary review of it in about 30 minutes to ensure I didn't miss anything.

This passes my review. @Papaul, did you want to handle putting this in for remote hands since you wrote the phase 2 directions and will be doing the network implementation? Just please add me to the ticket CC list (it has option to list multiple folks at the end of the ticket form on the portal.). Thanks!

@cmooney please see below for all the DNS names for IPV6 needed. Thanks

irb0-411.asw1-22-ulsfo.wikimedia.org
irb0-421.asw1-22-ulsfo.ulsfo.wmnet
ethernet-1-48.asw1-22-ulsfo.ulsfo.wmnet
irb0-412.asw1-23-ulsfo.wikimedia.org
irb0-422.asw1-23-ulsfo.ulsfo.wmnet
ethernet-1-48.asw1-23-ulsfo.ulsfo.wmnet
ethernet-1-55.asw1-22-ulsfo.wikimedia.org
ethernet-1-56.asw1-22-ulsfo.wikimedia.org
ethernet-1-55.asw1-23-ulsfo.wikimedia.org
ethernet-1-56.asw1-23-ulsfo.wikimedia.org
et-0-0-1.cr3-ulsfo.wikimedia.org
et-0-0-2.cr3-ulsfo.wikimedia.org
et-0-0-1.cr4-ulsfo.wikimedia.org
et-0-0-2.cr4-ulsfo.wikimedia.org
ge-0-0-3.mr1-uslfo.ulsfo.wmnet
ge-0-0-4.mr1-uslfo.ulsfo.wmnet

@Papaul thanks. I see most of those don't exist even for IPv4, nor are there any IPv6 addresses listed, so I'm not sure exactly what might need to be added.

Basically we will need an INCLUDE statement in the dns zone for every /64 we create. The best way forward is just for you to add them in Netbox as you go. After you add the first IP with a dns_name in each /64 ping me and I will create a patch for the /64 in the dns repo (similar to this one). I usually use this script to do it btw, so feel free to have a stab yourself too.

But it's a busy week so I am happy to help, ping me for any of them I can do it quickly. I am not around Monday though as it's a holiday here, we can go back and add the dns_names for any you do Monday on Tuesday.

Mentioned in SAL (#wikimedia-operations) [2026-05-04T14:00:14Z] <slyngshede@cumin1003> START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: New switch configuration, T408892]

Mentioned in SAL (#wikimedia-operations) [2026-05-04T14:00:20Z] <slyngshede@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: New switch configuration, T408892]

Mentioned in SAL (#wikimedia-operations) [2026-05-04T14:00:44Z] <slyngshede@cumin1003> conftool action : set/pooled=no; selector: cluster=dnsbox,dc=ulsfo [reason: ulsfo switch refresh T408892]

Minor error in command, should have been:

$ ssh cumin1003.eqiad.wmnet
$ sudo cookbook sre.dns.admin depool ulsfo -t T408892 -r "New switch configuration"
$ sudo confctl --reason 'ulsfo switch refresh T408892' select 'cluster=dnsbox,dc=ulsfo' set/pooled=no

Otherwise OK.
ULSFO has now been depooled.

Depooling command output, for the records:

slyngshede@cumin1003:~$ sudo cookbook sre.dns.admin depool ulsfo -t T408892 -r "New switch configuration"
==> CURRENT STATE:
text-addrs: pooled at all sites
text-next: pooled at all sites
upload-addrs: pooled at all sites
ncredir-addrs: pooled at all sites
gerrit-addrs: pooled at all sites
<==
Acquired lock for key /spicerack/locks/cookbooks/sre.dns.admin: {'concurrency': 1, 'created': '2026-05-04 14:00:14.817325', 'owner': 'slyngshede@cumin1003 [3973470]', 'ttl': 60}
START - Cookbook sre.dns.admin DNS admin: depool ulsfo [reason: New switch configuration, T408892]
==> You are now about to: depool ulsfo
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Setting pooled=no for tags: {'name': 'ulsfo'}
==> APPLIED STATE:
text-addrs: depooled in ulsfo
text-next: depooled in ulsfo
upload-addrs: depooled in ulsfo
ncredir-addrs: depooled in ulsfo
gerrit-addrs: depooled in ulsfo
<==
Released lock for key /spicerack/locks/cookbooks/sre.dns.admin: {'concurrency': 1, 'created': '2026-05-04 14:00:14.817325', 'owner': 'slyngshede@cumin1003 [3973470]', 'ttl': 60}
END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool ulsfo [reason: New switch configuration, T408892]
slyngshede@cumin1003:~$ sudo confctl --reason 'ulsfo switch refresh T408892' select 'cluster=dnsbox,dc=ulsfo' set/pooled=no
The selector you chose has selected the following objects:
{"/ulsfo/dnsbox/ntp-a": ["dns4003.wikimedia.org"], "/ulsfo/dnsbox/ntp-b": ["dns4004.wikimedia.org"], "/ulsfo/dnsbox/recdns": ["dns4003.wikimedia.org", "dns4004.wikimedia.org"], "/ulsfo/dnsbox/authdns-ns2": ["dns4003.wikimedia.org", "dns4004.wikimedia.org"], "/ulsfo/dnsbox/authdns-update": ["dns4003.wikimedia.org", "dns4004.wikimedia.org"]}
Ok to continue? [y/N]
confctl>y
ulsfo/dnsbox/ntp-a/dns4003.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/ntp-b/dns4004.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/recdns/dns4003.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/recdns/dns4004.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/authdns-ns2/dns4003.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/authdns-ns2/dns4004.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/authdns-update/dns4003.wikimedia.org: pooled changed yes => no
ulsfo/dnsbox/authdns-update/dns4004.wikimedia.org: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: cluster=dnsbox,dc=ulsfo [reason: ulsfo switch refresh T408892]

Icinga downtime and Alertmanager silence (ID=6733bed9-572f-4b81-9a71-76b2217ca3b5) set by pt1979@cumin1003 for 4:00:00 on 4 host(s) and their services with reason: switch refresh

asw2-ulsfo,cr[3-4]-ulsfo,mr1-ulsfo

Icinga downtime and Alertmanager silence (ID=ea06e422-63a1-4feb-89ac-13f0b89b4956) set by pt1979@cumin1003 for 4:00:00 on 5 host(s) and their services with reason: switch refresh

cr[3-4]-ulsfo IPv6,cr[3-4]-ulsfo.mgmt,mr1-ulsfo IPv6

Change #1282374 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/homer/public@master] Add BGP peering from core routers to switches

https://gerrit.wikimedia.org/r/1282374

Change #1282374 merged by jenkins-bot:

[operations/homer/public@master] Add BGP peering from core routers to switches

https://gerrit.wikimedia.org/r/1282374

Icinga downtime and Alertmanager silence (ID=241a7848-479d-48b2-8824-9a08c17249ab) set by ayounsi@cumin1003 for 20:00:00 on 39 host(s) and their services with reason: switches replacement

bast4006.wikimedia.org,cp[4037-4052].ulsfo.wmnet,dns[4003-4004].wikimedia.org,doh[4003-4004].wikimedia.org,durum[4003-4004].ulsfo.wmnet,ganeti[4005-4008].ulsfo.wmnet,hcaptcha-proxy[4003-4004].wikimedia.org,install4004.wikimedia.org,lvs[4008-4010].ulsfo.wmnet,ncredir[4003-4004].ulsfo.wmnet,netflow4003.ulsfo.wmnet,prometheus4003.ulsfo.wmnet,tcp-proxy[4003-4004].ulsfo.wmnet

Change #1282427 had a related patch set uploaded (by Papaul; author: Papaul):

[operations/homer/public@master] Add bgp from mr to core switches

https://gerrit.wikimedia.org/r/1282427

Change #1282427 merged by Papaul:

[operations/homer/public@master] Add bgp from mr to core switches

https://gerrit.wikimedia.org/r/1282427

All the servers in rack 22 are connected to the new switch and all the link are up I just tested cp4037 but all others should be online.
We had 2 issues :
1- The 1m MTP cable order to the switch/router connections was too short so we didn't make the connection from asw1-23 ethernet-1/55 to cr3 et-0/0/2.@RobH
has put in a order for purchases some 2M
2- The 1m 25G DAC cables where to short for the server/switch connection so we used most of the 2M 25G DAC for rack 22. I asked them to provide with the count of cable left onsite to see if we can use some 1M the servers close to the switch in rack 23 and for the others servers with can keep them at 10G and order more 2m 25G DAC
What left?

  • rack 23 servers migration to new switches
  • some cable id's that i am still waiting for
  • move oob to ge-0/0/7
  • DNS name for IPV6 in Netbox

Change #1282711 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] ulsfo Bird (dns, ganeti, VMs) peer with ToR switch

https://gerrit.wikimedia.org/r/1282711

Change #1282711 merged by Ayounsi:

[operations/puppet@production] ulsfo Bird (dns, ganeti, VMs) peer with ToR switch

https://gerrit.wikimedia.org/r/1282711

Change #1282731 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] ulsfo LVS: peer with the ToR switch

https://gerrit.wikimedia.org/r/1282731

Change #1282780 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] ulsfo liberica BGP: peer with the ToR switch

https://gerrit.wikimedia.org/r/1282780

Change #1282780 merged by Ayounsi:

[operations/puppet@production] ulsfo liberica BGP: peer with the ToR switch

https://gerrit.wikimedia.org/r/1282780

Change #1282805 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] ulsfo: update and add missing includes

https://gerrit.wikimedia.org/r/1282805

Change #1282805 merged by Ayounsi:

[operations/dns@master] ulsfo: update and add missing includes

https://gerrit.wikimedia.org/r/1282805

Icinga downtime and Alertmanager silence (ID=bdfd24a0-f5cd-4c3b-945b-36deeb91ba1c) set by ayounsi@cumin1003 for 20:00:00 on 13 host(s) and their services with reason: switches replacement

cp[4038,4040,4042,4044,4046,4048,4050,4052].ulsfo.wmnet,dns4004.wikimedia.org,ganeti[4006,4008].ulsfo.wmnet,lvs[4008,4010].ulsfo.wmnet

Change #1282925 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] ulsfo: update switch monitoring

https://gerrit.wikimedia.org/r/1282925

Change #1282925 merged by Ayounsi:

[operations/puppet@production] ulsfo: update switch monitoring

https://gerrit.wikimedia.org/r/1282925

Change #1282971 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] ulsfo: re-add old switch

https://gerrit.wikimedia.org/r/1282971

Change #1282971 merged by Ayounsi:

[operations/puppet@production] ulsfo: re-add old switch

https://gerrit.wikimedia.org/r/1282971

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with errors:

  • cp4038 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp4038.ulsfo.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with errors:

  • cp4038 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp4038.ulsfo.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie executed with errors:

  • cp4038 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console cp4038.ulsfo.wmnet" to get a root shell, but depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2026-05-05T19:05:20Z] <cmooney@cumin1003> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set correct vlan group in netbox for new ulsfo vlans - cmooney@cumin1003 - T408892"

Mentioned in SAL (#wikimedia-operations) [2026-05-05T19:05:26Z] <cmooney@cumin1003> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set correct vlan group in netbox for new ulsfo vlans - cmooney@cumin1003 - T408892"

Change #1283070 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] network/data.yaml: add new ulsfo ranges

https://gerrit.wikimedia.org/r/1283070

Change #1283070 merged by Ayounsi:

[operations/puppet@production] network/data.yaml: add new ulsfo ranges

https://gerrit.wikimedia.org/r/1283070

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1003 for host cp4038.ulsfo.wmnet with OS trixie completed:

  • cp4038 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202605052014_pt1979_388431_cp4038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@RobH see below the list of node still on 10G DAC that We will need to move to 25G DAC. Can you please order 7x2m 25G DAC? Thank you
A:papaul@asw1-23-ulsfo# show interface brief
+---------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+------------------------------------+

PortAdmin StateOper StateSpeedTypeDescription

+=====================+====================================+====================================+====================================+====================================+====================================+

ethernet-1/1enableup10GSFP+ PASSIVEcp4038 {#cp4038d}
ethernet-1/2enableup10GSFP+ PASSIVEcp4040 {#cp4040d}
ethernet-1/3enableup10GSFP+ PASSIVEcp4042 {#cp4042d}
ethernet-1/4enableup10GSFP+ PASSIVEcp4046 {#cp4046d}
ethernet-1/5enableup10GSFP+ PASSIVEcp4044 {#cp4044d}
ethernet-1/6enableup10GSFP+ PASSIVEcp4048 {#cp4048d}
ethernet-1/9enableup10GSFP+ PASSIVEdns4004 {#1047}

All the servers in rack 23 are online and ready for re-image. I tested the re-image on cp4038 and completed with no issues after @ayounsi fixed the DHCP issue. The list of servers above are still on 10G because the 1m DAC 25G were too short to use. We will be ordering some 2m 25G cable for the replacement. As for now the migration on DC-ops and Netops side is complete.

Mentioned in SAL (#wikimedia-operations) [2026-05-06T16:52:37Z] <topranks> rebooting asw1-22-ulsfo to upgrade SR-Linux OS on switch T408892

Icinga downtime and Alertmanager silence (ID=10a0938d-1a48-40b9-87ab-384b64ac02a6) set by cmooney@cumin1003 for 1:00:00 on 2 host(s) and their services with reason: upgrading sr-linux on asw1-23-ulsfo

asw1-22-ulsfo,asw1-22-ulsfo IPv6

Icinga downtime and Alertmanager silence (ID=a4b7dc3f-da06-4cb4-8580-9dac41f4da23) set by sukhe@cumin1003 for 3 days, 0:00:00 on 39 host(s) and their services with reason: ulsfo depooled for switch work

bast4006.wikimedia.org,cp[4037-4052].ulsfo.wmnet,dns[4003-4004].wikimedia.org,doh[4003-4004].wikimedia.org,durum[4003-4004].ulsfo.wmnet,ganeti[4005-4008].ulsfo.wmnet,hcaptcha-proxy[4003-4004].wikimedia.org,install4004.wikimedia.org,lvs[4008-4010].ulsfo.wmnet,ncredir[4003-4004].ulsfo.wmnet,netflow4003.ulsfo.wmnet,prometheus4003.ulsfo.wmnet,tcp-proxy[4003-4004].ulsfo.wmnet

Icinga downtime and Alertmanager silence (ID=cc7686ab-d152-4291-9303-296008017c88) set by cmooney@cumin1003 for 1:00:00 on 2 host(s) and their services with reason: upgrading sr-linux on asw1-23-ulsfo

asw1-23-ulsfo,asw1-23-ulsfo IPv6

Mentioned in SAL (#wikimedia-operations) [2026-05-06T17:28:11Z] <topranks> rebooting asw1-23-ulsfo to upgrade SR-Linux OS on switch T408892

Change #1284558 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Remove asw2-ulsfo

https://gerrit.wikimedia.org/r/1284558

Change #1284561 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] asw2-ulsfo: remove from monitoring

https://gerrit.wikimedia.org/r/1284561

Change #1284558 merged by jenkins-bot:

[operations/homer/public@master] Remove asw2-ulsfo

https://gerrit.wikimedia.org/r/1284558

Change #1284561 merged by Ayounsi:

[operations/puppet@production] asw2-ulsfo: remove from monitoring

https://gerrit.wikimedia.org/r/1284561

Mentioned in SAL (#wikimedia-operations) [2026-05-07T10:14:37Z] <slyngshede@cumin1003> START - Cookbook sre.dns.admin DNS admin: pool ulsfo [reason: New switch configuration, T408892]

Mentioned in SAL (#wikimedia-operations) [2026-05-07T10:14:45Z] <slyngshede@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool ulsfo [reason: New switch configuration, T408892]

Mentioned in SAL (#wikimedia-operations) [2026-05-07T12:11:04Z] <slyngshede@cumin1003> conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=ulsfo,service=authdns-update [reason: ulsfo switch refresh T408892]

Mentioned in SAL (#wikimedia-operations) [2026-05-07T13:02:51Z] <slyngshede@cumin1003> conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [reason: ulsfo switch refresh T408892]

Change #1284640 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] ulsfo: remove VRRP checks on CR, add mgmt switch monitoring

https://gerrit.wikimedia.org/r/1284640

Change #1284640 merged by Ayounsi:

[operations/puppet@production] ulsfo: remove VRRP checks on CR, add mgmt switch monitoring

https://gerrit.wikimedia.org/r/1284640

Mentioned in SAL (#wikimedia-operations) [2026-05-07T14:32:48Z] <slyngshede@cumin1003> conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=ulsfo [reason: ulsfo switch refresh T408892]

RobH mentioned this in Unknown Object (Task).May 7 2026, 4:18 PM

Change #1286809 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Reverse PTR include: add statement for 2620:0:863:fe0a::/64

https://gerrit.wikimedia.org/r/1286809

Change #1286809 merged by Cathal Mooney:

[operations/dns@master] Reverse PTR include: add statement for 2620:0:863:fe0a::/64

https://gerrit.wikimedia.org/r/1286809

Change #1286831 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Reverse PTR Include: add for 2620:0:863:fe09::/64

https://gerrit.wikimedia.org/r/1286831

Change #1286831 merged by Cathal Mooney:

[operations/dns@master] Reverse PTR Include: add for 2620:0:863:fe09::/64

https://gerrit.wikimedia.org/r/1286831

Change #1286956 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add remaining INCLUDE statements for ulsfo IPv6 link address ranges

https://gerrit.wikimedia.org/r/1286956

Change #1286956 merged by Cathal Mooney:

[operations/dns@master] Add remaining INCLUDE statements for ulsfo IPv6 link address ranges

https://gerrit.wikimedia.org/r/1286956

The last BGP session between cr3 and asw1-23 is now up, We ca now close this task. Thanks to all that did help on this project.