Page MenuHomePhabricator

Stage and configure new Juniper switches in codfw rows E/F
Closed, ResolvedPublic

Description

The following new devices have been racked and connected to management in codfw:

ssw1-e1-codfw
ssw1-f1-codfw
lsw1-e1-codfw
lsw1-e3-codfw
lsw1-f1-codfw
lsw1-f3-codfw

The next step is to get them configured and working as we need. These devices will be configured in a spine/leaf arrangement similar to rows A/B and C/D in codfw, however they will not use VXLAN/EVPN. Instead we will do dual-stack on all links, with OSPF & OSFP3 across them, and unicast IBGP between loopbacks in each address family. The plan is to configure some of this manually to nail down the correct configuration, then work backwards from there to adjust out automation to support both routing designs

Connectivity to the core routers on site requires additional ports that will become available when T393552 is complete.

Related Objects

StatusSubtypeAssignedTask
Resolvedcmooney
Resolvedcmooney

Event Timeline

cmooney triaged this task as Medium priority.

Change #1145194 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Network: add puppet data for new devices and networks codfw expansion

https://gerrit.wikimedia.org/r/1145194

Change #1145246 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add new INCLUDE statements in 0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa zone

https://gerrit.wikimedia.org/r/1145246

Change #1145246 merged by Cathal Mooney:

[operations/dns@master] Add new INCLUDE statements in 0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa zone

https://gerrit.wikimedia.org/r/1145246

cmooney added a parent task: Unknown Object (Task).May 14 2025, 10:07 AM
cmooney added a subtask: Unknown Object (Task).

Change #1145977 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Enable link-protection and BFD on OSPF links on EVPN switches

https://gerrit.wikimedia.org/r/1145977

Change #1145977 merged by jenkins-bot:

[operations/homer/public@master] Enable link-protection on OSPF links on EVPN switches

https://gerrit.wikimedia.org/r/1145977

Change #1146662 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add EBGP between codfw row A-D spines and row E/F spines

https://gerrit.wikimedia.org/r/1146662

Mentioned in SAL (#wikimedia-operations) [2025-05-15T16:35:30Z] <topranks> add bgp peerings from codfw row A-D switches to new spines in rows E/F T394021

Change #1146662 merged by jenkins-bot:

[operations/homer/public@master] Add EBGP between codfw row A-D spines and row E/F spines

https://gerrit.wikimedia.org/r/1146662

Mentioned in SAL (#wikimedia-operations) [2025-05-15T17:23:16Z] <topranks> add remaining bgp peerings from codfw row A-D switches to new spines in rows E/F T394021

Change #1147014 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] New device additions for codfw expansion plus policy changes

https://gerrit.wikimedia.org/r/1147014

Change #1148842 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add new INCLUDE statement for 2620:0:860:139::/64 reverse

https://gerrit.wikimedia.org/r/1148842

Change #1148842 merged by Cathal Mooney:

[operations/dns@master] Add new INCLUDE statement for 2620:0:860:139::/64 reverse

https://gerrit.wikimedia.org/r/1148842

Change #1147014 merged by jenkins-bot:

[operations/homer/public@master] New device additions for codfw expansion plus policy changes

https://gerrit.wikimedia.org/r/1147014

Mentioned in SAL (#wikimedia-operations) [2025-05-21T13:16:49Z] <cmooney@cumin1003> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "import new switches from netbox to hiera now they are status active - cmooney@cumin1003 - T394021"

Mentioned in SAL (#wikimedia-operations) [2025-05-21T13:17:44Z] <cmooney@cumin1003> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "import new switches from netbox to hiera now they are status active - cmooney@cumin1003 - T394021"

Change #1148867 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw

https://gerrit.wikimedia.org/r/1148867

Change #1148867 merged by jenkins-bot:

[operations/homer/public@master] Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw

https://gerrit.wikimedia.org/r/1148867

Change #1145194 merged by Cathal Mooney:

[operations/puppet@production] Network: add puppet data for new devices and networks codfw expansion

https://gerrit.wikimedia.org/r/1145194

Change #1150642 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add entry for cagefive2* hosts in site.pp

https://gerrit.wikimedia.org/r/1150642

@Jhancock.wm hey I'm having some problems reaching cagefive2001 over management. The IP it is assigned is not responding to ping when I try? I know this had an old mgmt IP (10.193.1.94, re-used since for wikikube-worker2251), but doing a few tests it does not appear to be listening on this IP either right now.

Would you be able to set the server iDRAC IP to 10.193.3.224/16 manually and I can re-try?

Change #1150642 abandoned by Cathal Mooney:

[operations/puppet@production] Add entry for cagefive2* hosts in site.pp

Reason:

will try to rename the hosts

https://gerrit.wikimedia.org/r/1150642

@cmooney got it set and confirmed it pings

awesome, thank you!

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2007.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2007.codfw.wmnet with OS bookworm completed:

  • sretest2007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505291556_cmooney_780917_sretest2007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Change #1152301 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add BGP session from cr1-codfw to ssw1-e1-codfw and remove nokia

https://gerrit.wikimedia.org/r/1152301

Change #1152301 merged by jenkins-bot:

[operations/homer/public@master] Add BGP session from cr1-codfw to ssw1-e1-codfw and remove nokia

https://gerrit.wikimedia.org/r/1152301

Change #1153120 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Codfw expansion: enable BGP from cr2-codfw to ssw1-f1-codfw

https://gerrit.wikimedia.org/r/1153120

Change #1153120 merged by jenkins-bot:

[operations/homer/public@master] Codfw expansion: enable BGP from cr2-codfw to ssw1-f1-codfw

https://gerrit.wikimedia.org/r/1153120

Change #1153141 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Typo in bgp peer definition on cr2-codfw

https://gerrit.wikimedia.org/r/1153141

Change #1153141 merged by jenkins-bot:

[operations/homer/public@master] Typo in bgp peer definition on cr2-codfw

https://gerrit.wikimedia.org/r/1153141