Codfw row A/B top-of-rack switch refresh
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	cmooney
	Jan 25 2023, 5:55 PM

Description

New Juniper QFX5120-48Y top-of-rack switches have been delivered to codfw under T312138. These are intended to replace the existing switches in rows A and B in codfw, as part of a normal refresh cycle.

We have a current requirement for L2 Vlans that stretch across multiple racks (Ganeti, LVS), which for the current rows is achieved with Juniper's virtual-chassis feature. A different approach to this will be adopted on the replacement switches, using VXLAN/EVPN similar to the setup in Eqiad rows E and F. The plan involves deploying 2 QFX5120-32C Spine/aggregation switches to interconnect the top-of-racks in rows A and B, but these have not yet been delivered. Current estimate for those is August 2023.

In a meeting on Jan 24th 2023 Infra Foundations (Cathal, Arzhel, Riccardo) and DC-Ops (Papaul) agreed that there was not great urgency, and we should wait until the Spine switches are delivered and ready before we start moving anything. In other words no interim plan using only the 5120-48Y devices is being contemplated.

Physical Installation and Planning

Having the top-of-rack devices does allow us to get some things prepped in advance, however, and to plan the build/migration.

Spine Physical Location

The plan currently is to install the two Spine switches, aggregating rows A and B, in racks A1 and A8. Those locations make the uplinks to the CR routers (also in those racks) easier, and was recommended by Papaul. It was provisionally decided to place the "future" Spines (that will aggregate replacement switches in rows C and D when the time comes) in racks D1 and D8. But that is not part of the current task and can be revisited when the time comes.

Rack location

Papaul expressed a preference to install the switches in the top rows of all racks. This makes it easier to move the switches in and out of the rack, as they are higher than the PDUs so no power connectors get in the way.

Cables

Based on that we'll need 32 x single-mode LC-LC fibers, but I'm unsure of the exact lengths between all the racks:

Rack 1	Rack 2	Desc	Qty	Length	cable ID
A1	A1	Spine<->Leaf within rack	1	3mm	230403800036
A1	A1	Spine <->cr	1	3m	230403800035
A1	A2	Spine<->Leaf	1	5m
A1	A3	Spine<->Leaf	1	5m
A1	A4	Spine<->Leaf	1	5m
A1	A5	Spine<->Leaf	1	8m
A1	A6	Spine<->Leaf	1	8m
A1	A7	Spine<->Leaf	1	8m
A1	A8	Spine<->Leaf (leaf in each has link to spine in other)	2	8m

A1	B2	Spine<->Leaf	1	12m	230403800009
A1	B3	Spine<->Leaf	1	12m	230403800006
A1	B4	Spine<->Leaf	1	12m	230403800001
A1	B5	Spine<->Leaf	1	12m	230403800004
A1	B6	Spine<->Leaf	1	12m	230403800002
A1	B7	Spine<->Leaf	1	12m	230403800007
A1	B8	Spine<->Leaf	1	12m	230403800005

A8	A1	Spine<->Leaf	1	8m	230403800017
A8	A2	Spine<->Leaf	1	8m	230403800026
A8	A3	Spine<->Leaf	1	8m	230403800021
A8	A4	Spine<->Leaf	1	8m	230403800024
A8	A5	Spine<->Leaf	1	5m	230403800028
A8	A6	Spine<->Leaf	1	5m	230403800032
A8	A7	Spine<->Leaf	1	5m	230403800030
A8	A8	Spine<->Leaf within rack	1	3mm	230403800017
A8	A8	Spine<->cr	1	3m	230403800040

A8	B2	Spine<->Leaf	1	12m	230403800016
A8	B3	Spine<->Leaf	1	12m	230403800003
A8	B4	Spine<->Leaf	1	12m	230403800013
A8	B5	Spine<->Leaf	1	12m	230403800015
A8	B6	Spine<->Leaf	1	12m	230403800012
A8	B7	Spine<->Leaf	1	12m	230403800008
A8	B8	Spine<->Leaf	1	12m	230403800010

Transceiver

We will not some 1000Base-T SFP 100m to connect 1G server to the new switches. I have a Total of 76 right now on side.

Zeroconf

The preference is to use Zeroconf to take care of the initial base config for the new switches. This will take some development time but we've not identified any blockers thus far.

Configuration

Most of the Juniper configuration for these devices is currently automated through Homer. The one element that needs to be added is the BGP EVPN neighbor configuration (see T327934).

IP Allocations / Netbox

One element we may want to improve on is IP allocation and device assignment in Netbox, as well as DNS zone generation. There are a lot of point-to-point links, new subnets, loopbacks, irb interfaces etc to be added across all 17 devices. For Eqiad row E/F one-off scripting was used to generate some of this, but it may be worth developing a more robust, re-usable scripting for this as we'll likely need it again.

Migration

Once all the new switches are in place, connected and configured we can begin the work of migrating existing hosts.

Bridge existing Vlans?

Similar to Eqiad row E/F, the plan will be to add a per-rack private subnet which will be the default for new hosts installed in each rack. Ultimately the desire is for all hosts requiring normal private-vlan connectivity to be moved to these new, per-rack Vlans.

Unfortunately some hosts, specifically our Ganeti servers, have a requirement to be on the same Vlan (for VM live motion) and be in separate racks (for redundancy and operational flexibility). In the absence of any host-level L2-extention or routed solution to this (see T300152), we will likely need to provision a row-wide Vlan on the new switches for these hosts. The simplest option is probably to extend the existing private and public Vlans to the new switches and use those, as it avoids renumbering.

VC to EVPN switch connectivity

Extending the Vlans from the existing virtual-chassis to the new switches presents some challenges. As these are important production networks we need to have redundant connectivity in place. Connecting the VC master switches (2 and 7) to the EVPN Spine switches is probably the sensible way to physically connect the devices.

This gives us a problem in terms of L2 loop-prevention, however. If both Spines have independent trunks to the VC, with the same allowed Vlans, we'll create a loop. One solution that pops into my head is to create an ESI-LAG between the Spines and connect to the VC from that. Alternately we can look at maybe using Spanning Tree or other options.

Renumbering

If we don't extend the existing Vlans to the new switches we will need to renumber hosts when their physical connections are moved from old to new. And even if we do extend the Vlans it might make sense to renumber them at this point anyway (only one interruption for the host, and we have to do it eventually).

To allow for renumbering some development will need to happen to support a "--renumber" toggle for the reimage cookbook, which should delete the hosts existing IP allocation and add a new one.

Renumbering presents additional challenges in terms of services running on the hosts, if they come back online with different IPs. A few things we need to consider (there are likely more):

DNS needs to be updated, old entries can still be in DNS caches
- Is it possible to change the DNS TTLs in advance to help us here?
We may have hardcoded IPs in puppet for certain things. Possibly the renumbering script could perform a git grep of the IP in multiple repositories to look for these (like the decommissioning cookbook):
- Puppet
- Puppet private
- Mediawiki-config
- Deployment charts
- homer-public
DNS record resolved at catalog compile time by the Puppet master and those resolved for example by ferm at reload time (but could be any other service) will need update either forcing a puppet master or with a ferm reload or with a specific service reload/restart.
Databases:
- DB grants are issued per-IP
- mediawiki connects to the DB via IP
- dbctl has the IPs of the servers and gives it to the mediawiki config stored in etcd
- Backend servers behind LVS: TBD
- Ganeti servers: depends on the whole Ganeti discussion

Details

Subject	Repo	Branch	Lines +/-
Add netboot config for new private vlans in codfw rows A/B	operations/puppet	production	+214 -4
Add puppet elements for newly added switches.	operations/puppet	production	+272 -17
Support configuration of EVPN anycast GW on switches	operations/homer/public	master	+25 -15
Homer YAML additions for new row A/B switches in Codfw	operations/homer/public	master	+170 -0
Add static network defs and DHCP config for new codfw subnets	operations/puppet	production	+140 -2
Add includes for new /24s used in EVPN underlay network codfw	operations/dns	master	+10 -0
Add includes for Netbox generated dns for new per-rack codfw subnets	operations/dns	master	+150 -1
Add includes for IPv6 reverse ranges for new linknets from CRs to SSW	operations/dns	master	+10 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	cmooney	T327938 Codfw row A/B top-of-rack switch refresh
Resolved	Papaul	T326564 codfw: Relocate servers to make space for new switches in rowA and rowB
Resolved	cmooney	T329369 Export routes generated from ARP/ND in EVPN
		Restricted Task
		Unknown Object (Task)
Resolved	Papaul	T332180 Codfw:row A/B: rack/cable new switches
Resolved	cmooney	T333441 Automate Netbox additions for new spine/leaf L3 networks.
Resolved	cmooney	T341670 Upgrade new codfw switches to Juniper recommended
		Restricted Task
Resolved	cmooney	T347191 Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans
Resolved	cmooney	T348128 Codfw row A-B migration - non-standard device moves
Resolved	cmooney	T348159 Migrate atlas-codfw from asw-a1-codfw to lsw1-a2-codfw
Resolved	cmooney	T348164 Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw
Resolved	cmooney	T348178 Migrate lvs2011 and lvs2012 to new top-of-rack switches
Resolved	cmooney	T352909 Move lvs2012 primary uplink and connect to new row A/B vlans
Resolved	cmooney	T352912 Move lvs2011 primary uplink and connect to new row A/B vlans
Resolved	cmooney	T348218 Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches
Resolved	cmooney	T352758 Move lvs2014 link to row A and connect to new row A/B vlans
Resolved	cmooney	T352784 Move lvs2013 link to row A and connect to new row A/B vlans
Resolved	Papaul	T348125 Codfw row A-B server moves - port-block constraint / numbering
Resolved	• ayounsi	T348129 Create automation to move servers in Netbox from old to new switch
Resolved	cmooney	T348225 Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Peachey88 updated the task description. (Show Details)Jan 25 2023, 10:20 PM

cmooney updated the task description. (Show Details)Jan 25 2023, 10:26 PM

Thanks for the summary!

Some additional notes/thoughts:

public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (4 racks per sites) and limiting wasting IPs and making renumbering not necessary

VC to EVPN switch connectivity

the current rows A and B have all their 40G ports in use, so unless we manage to decom 1 switch (asw-b1-codfw, as the rack is being dedicated to WMCS) we will have to use 10G LAGs.
When we did similar migration in the past we used a single LAG to prevent loops, in that case the switch on which terminates the LAG on the new fabric would be a SPOF. I don't have experience with ESI-LAG, let's see what the trade-offs are.

On the renumbering, to help make some of the move easier (especially the low hanging fruits) and test any automation script an idea is to start renumbering the hosts on their current switches.
For example create private1-a4-codfw on the existing row A virtual-chassis, identify which hosts are not blockers (eg. hosts not behind LVS, not Ganeti hosts, etc).
Those hosts will then be easily migrated to the new fabric top of rack when it's ready: during a maintenance, turn off the relevant vlan on the core router, move the hosts' cables, and start the prefix advertisement on the new fabric.

• ayounsi added a subtask: T326564: codfw: Relocate servers to make space for new switches in rowA and rowB.Jan 27 2023, 10:39 AM

In T327938#8560262, @ayounsi wrote:

public1-a/b-codfw host might be better grouped in a single rack per row, providing still redundancy (4 racks per sites) and limiting wasting IPs and making renumbering not necessary

Sure. I guess the only drawback there is moving the servers which may already be spread out. But overall it works.

VC to EVPN switch connectivity

the current rows A and B have all their 40G ports in use, so unless we manage to decom 1 switch (asw-b1-codfw, as the rack is being dedicated to WMCS) we will have to use 10G LAGs.

Maybe we should aim for that. If we do it we should be mindful of the issues we've seen before changing vc-port to regular trunks. But hopefully the upcoming upgrades will prevent any similar funny stuff.

When we did similar migration in the past we used a single LAG to prevent loops, in that case the switch on which terminates the LAG on the new fabric would be a SPOF. I don't have experience with ESI-LAG, let's see what the trade-offs are.

Yeah if we can tolerate the SPOF it's certainly easier than implementing the multi-chassis solution. In terms of ESI-LAG I've not done it before either, I assume it's relatively straightforward and reliable (it's Juniper's recommended approach these days). But definitely would require some decent research/learning/testing time so if we can avoid it great.

@ayounsi @Papaul one other thing we didn't discuss last week was QSFP28 optics for the 100G switch -> switch links (and CR uplinks) We used 100GBase-CWDM4's in Eqiad, with duplex single-mode fiber. It didn't work out that much more expensive than using 100GBase-SR4 due to the MPO / multi-core fiber they need being pricier. But we also had the cross-cage links there, so regular LC connectors were required for some of them, a constraint we don't have in codfw.

I've no particular preference, if I were doing it myself probably a slight one for the CWDM4/LC links, but happy to go with whatever the consensus/cheapest is.

I've no particular preference, if I were doing it myself probably a slight one for the CWDM4/LC links, but happy to go with whatever the consensus/cheapest is.

No strong preference, consistency with eqiad makes sens to me, what's easiest for @Papaul as well. We won't be able to re-use the existing cabling as the two fabrics will run in parallel for a while. And the current infra is 40G so we won't need to re-use the optics neither.

cmooney added a subtask: T329369: Export routes generated from ARP/ND in EVPN.Feb 10 2023, 1:46 PM

@Papaul in terms of the cables we will need to begin as follows. I'm assuming here we go with 100GBase-CWDM4, and therefore single-mode lc-lc links. If you'd prefer we use multi-mode cables with MPO connectors we can revise.

1: Cables

Based on that we'll need 32 x single-mode LC-LC fibers, but I'm unsure of the exact lengths between all the racks. See table in task description for full list.

2: Optic Modules

To terminate these either side we will need to order 64 x 100GBase-CWDM4 optics, plus I'd recommend getting 2 spares. So 66 in total.

Other considerations

CR Links

I'm assuming here we can run 100G links to the QSFP28 ports on the CRs. Based on current usage we should be able to add 1 x 100G link on port 2 or 5 of either PIC on the MPC7E cards. Currently they're all at 120G, with 3x40G used, adding 100 brings that to 220G, under the 240G limit.

NOTE: We potentially could also do "cross links" from each CR to both Spines. At 100G this will use up all remaining bw on the MPC7E's. Or we could decide to cable like this but run at 40G, to increase redundancy but not bandwidth. @ayounsi interested to hear your thoughts, personally my instinct is to stick with the Spine1->CR1 and Spine2->CR2 setup, keeping things the same as Eqiad.

Migration Strategy / Direct links between VC and EVPN fabrics

These totals do not include cabling from the new Spines to existing virtual-chassis switches, which would be required if our plan is to bridge existing Vlans to the new switches (and thus allow us to move hosts from old to new switch without any change on hosts).

That question is tricky. Currently we have no QSFP+ free ports on the VC switches to facilitate such connections. Bringing cloudsw1-b1-codfw live, and migrating the cloud hosts in that rack to it, will free up 1 such port on asw-b2-codfw and asw-b7-codfw, which could then be re-used to connect to ssw1-a1-codfw/ssw1-a8-codfw.

But we don't have a similar option for row A, so I'm not sure what might be realistic here. Either way I suspect we could re-use the 40G optics in use if we re-use ports, or if we do something else like use 10G links we will need to order those closer the time.

@cmooney I was about to update the table but I can't only you can. So for everything going from A1 to Bx and A8 to Bx should be 12m (x=1,2,3,4,5,6,7,8). I will get up the numbers of A1 to Ay and A8 to Ay (y=2,3,4,5,6,7) some time next week. Thanks

@ayounsi interested to hear your thoughts, personally my instinct is to stick with the Spine1->CR1 and Spine2->CR2 setup, keeping things the same as Eqiad.

Agreed!

I was about to update the table but I can't only you can.

I copied the table to the task description

• ayounsi updated the task description. (Show Details)Mar 3 2023, 7:50 AM

@ayounsi thanks for updating the desc!

@Papaul I'll update the table with the info provided and get back to you if any more questions.

I'll also put together a cost comparison of SR4/MPO vs CWDM4/LC-Duplex for the runs. There are quite a lot, want to make sure we're not needlessly wasting foundation funds, esp. in current climate.

cmooney updated the task description. (Show Details)Mar 3 2023, 12:49 PM

Papaul updated the task description. (Show Details)Mar 7 2023, 3:35 PM

Papaul updated the task description. (Show Details)Mar 7 2023, 3:38 PM

@cmooney I update the table with lengths between all the racks.

Papaul updated the task description. (Show Details)Mar 7 2023, 4:06 PM

cmooney added a subtask: Restricted Task.Mar 8 2023, 12:42 PM

cmooney added a subtask: T333441: Automate Netbox additions for new spine/leaf L3 networks..Mar 29 2023, 11:49 AM

• ayounsi moved this task from Backlog to This quarter on the netops board.Apr 5 2023, 12:23 PM

RobH closed subtask Restricted Task as Resolved.May 1 2023, 12:22 PM

Papaul updated the task description. (Show Details)Jun 5 2023, 3:48 PM

Papaul updated the task description. (Show Details)Jun 5 2023, 4:37 PM

Papaul updated the task description. (Show Details)Jun 12 2023, 3:18 PM

Papaul updated the task description. (Show Details)Jun 12 2023, 3:23 PM

Papaul updated the task description. (Show Details)Jun 12 2023, 5:07 PM

Papaul updated the task description. (Show Details)

Papaul updated the task description. (Show Details)Jun 12 2023, 5:24 PM

Papaul updated the task description. (Show Details)Jun 12 2023, 5:28 PM

Papaul closed subtask T326564: codfw: Relocate servers to make space for new switches in rowA and rowB as Resolved.Jun 15 2023, 3:38 PM

Papaul updated the task description. (Show Details)Jun 15 2023, 3:54 PM

Papaul updated the task description. (Show Details)Jun 15 2023, 3:57 PM

Papaul updated the task description. (Show Details)Jun 15 2023, 4:30 PM

Papaul closed subtask T332180: Codfw:row A/B: rack/cable new switches as Resolved.Jun 16 2023, 12:35 AM

cmooney added a subtask: T341670: Upgrade new codfw switches to Juniper recommended.Jul 26 2023, 4:15 PM

@Papaul thanks for the work documenting the cable IDs. I've put the ones from above in Netbox now.

There is one discrepancy, the same label is listed for two different runs:

A8	A1	Spine<->Leaf	1	8m	230403800017
A8	A8	Spine<->Leaf within rack	1	3mm	230403800017

I added those with a generic label to Netbox if you can check / confirm the right ones.

https://netbox.wikimedia.org/dcim/cables/?color=&length=&length_unit=&q=changeme_&site_id=9

Thanks!

cmooney closed subtask T341670: Upgrade new codfw switches to Juniper recommended as Resolved.Sep 1 2023, 9:54 PM

Change 954684 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW

https://gerrit.wikimedia.org/r/954684

gerritbot added a project: Patch-For-Review.Sep 4 2023, 1:22 PM

Change 954684 merged by Cathal Mooney:

[operations/dns@master] Add includes for IPv6 reverse ranges for new linknets from CRs to SSW

https://gerrit.wikimedia.org/r/954684

Maintenance_bot removed a project: Patch-For-Review.Sep 4 2023, 1:30 PM

Change 954697 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Homer YAML additions for new row A/B switches in Codfw

https://gerrit.wikimedia.org/r/954697

gerritbot added a project: Patch-For-Review.Sep 4 2023, 2:28 PM

cmooney added a subtask: Restricted Task.Sep 4 2023, 4:38 PM

Change 954893 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for Netbox generated dns for new per-rack codfw subnets

https://gerrit.wikimedia.org/r/954893

Change 954896 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add static network defs and DHCP config for new codfw subnets

https://gerrit.wikimedia.org/r/954896

netbox cable id update for ssw1-a8 to lsw1-a1 and lsw-a8

Change 954893 merged by Cathal Mooney:

[operations/dns@master] Add includes for Netbox generated dns for new per-rack codfw subnets

https://gerrit.wikimedia.org/r/954893

Change 954980 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Add includes for new /24s used in EVPN underlay network codfw

https://gerrit.wikimedia.org/r/954980

Change 954980 merged by Cathal Mooney:

[operations/dns@master] Add includes for new /24s used in EVPN underlay network codfw

https://gerrit.wikimedia.org/r/954980

cmooney closed subtask T333441: Automate Netbox additions for new spine/leaf L3 networks. as Resolved.Sep 6 2023, 8:49 AM

cmooney closed subtask Restricted Task as Resolved.Sep 6 2023, 10:39 AM

@Papaul I've done some testing and I'm confident the IP GW moves for the row subnets to the Spines can be done gracefully. I've yet to work on BGP, but either way I think we need to plan out the links between existing switch rows and the new spines. As discussed we'll disconnect the links from CRs to ASWs to free up the ASW ports for these.

Ultimately we'll have:

Row	VC Switch	VC Switch Port	Spine Switch	Spine Switch Port
A	asw-a2-codfw	et-2/0/52	ssw1-a1-codfw	et-0/0/29
A	asw-a7-codfw	et-7/0/52	ssw1-a8-codfw	et-0/0/29
B	asw-b2-codfw	et-2/0/51	ssw1-a1-codfw	et-0/0/30
B	asw-b7-codfw	et-7/0/52	ssw1-a8-codfw	et-0/0/30

Those ports on the vc switches are in use at the moment though, for the uplinks to the CRs. So we need to move them 1 by 1, co-ordinated with netops, while we make changes on the devices to move the GW IP from CRs to SPINEs.

The ASW's have 40GBase-SR4 optics in them already, we can re-use those. We can take the optics from the CRs and use them to terminate on the Spines so should be ok for modules. I'm not 100% sure if you need new multi-core/MPO multi-mode fibers, or if we can re-use the ones already in place (given they go between the same cabs).

Anyway just a heads up so you can be prepared. If you want me to open a separate task let me know. Thanks!

@cmooney thanks for the update. I think we can reuse those the MPO

In terms of the LVS connections from rows C and D, when we move from old switches to new ones we need to land those on the Spines rather than on the top-of-racks as in the old design.

This needs to be carefully co-ordinated to not cause interruption, but in terms of the final cabling it will be like this:

LVS	Old Switch	Old Port	New Switch	New Port
lvs2013	asw-a2-codfw	xe-2/0/43	ssw1-a1-codfw	xe-0/0/32
lvs2014	asw-a4-codfw	xe-4/0/47	ssw1-a1-codfw	xe0/0/33
lvs2013	asw-b2-codfw	xe-2/0/43	ssw1-a8-codfw	xe-0/0/32
lvs2014	asw-b4-codfw	xe-4/0/47	ssw1-a8-codfw	xe0/0/33

We need to be careful to take note of this when migrating in cabs A2/A4/B2/B4.

Change 954896 merged by Cathal Mooney:

[operations/puppet@production] Add static network defs and DHCP config for new codfw subnets

https://gerrit.wikimedia.org/r/954896

Change 954697 merged by jenkins-bot:

[operations/homer/public@master] Homer YAML additions for new row A/B switches in Codfw

https://gerrit.wikimedia.org/r/954697

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2023, 1:11 PM

Change 959873 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Support configuration of EVPN anycast GW on switches

https://gerrit.wikimedia.org/r/959873

gerritbot added a project: Patch-For-Review.Sep 21 2023, 9:10 PM

Change 959873 merged by jenkins-bot:

[operations/homer/public@master] Support configuration of EVPN anycast GW on switches

https://gerrit.wikimedia.org/r/959873

Maintenance_bot removed a project: Patch-For-Review.Sep 26 2023, 4:12 PM

cmooney added a subtask: T345803: Connect two hosts in codfw row A/B for switch migration testing.Sep 28 2023, 9:31 PM

cmooney added a subtask: T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans.

• ayounsi mentioned this in Unknown Object (Task).Oct 2 2023, 7:14 AM

cmooney closed subtask T329369: Export routes generated from ARP/ND in EVPN as Resolved.Oct 2 2023, 10:03 AM

cmooney added a subtask: T348128: Codfw row A-B migration - non-standard device moves.Oct 4 2023, 12:21 PM

cmooney added a subtask: T348125: Codfw row A-B server moves - port-block constraint / numbering.

cmooney added a subtask: T348129: Create automation to move servers in Netbox from old to new switch.Oct 4 2023, 12:23 PM

cmooney closed subtask T348125: Codfw row A-B server moves - port-block constraint / numbering as Resolved.Oct 4 2023, 2:59 PM

cmooney added a subtask: T348225: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014.Oct 5 2023, 9:30 AM

@cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that have per-rack subnets and to treat them differently, but it needs to be aware of rows then it needs some refactoring and possible get the information live instead of being hardcoded.

Change 965148 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add puppet elements for newly added switches.

https://gerrit.wikimedia.org/r/965148

gerritbot added a project: Patch-For-Review.Oct 11 2023, 12:55 PM

Change 965148 merged by Cathal Mooney:

[operations/puppet@production] Add puppet elements for newly added switches.

https://gerrit.wikimedia.org/r/965148

Maintenance_bot removed a project: Patch-For-Review.Oct 13 2023, 11:12 AM

• ayounsi mentioned this in T350152: Automation to change a server's vlan.Oct 31 2023, 2:36 PM

• ayounsi added a subtask: T350152: Automation to change a server's vlan.

In T327938#9234691, @Volans wrote:

@cmooney adding a note here to not forget. We'll need to check how it will work for Ganeti VMs, in particular the makevm cookbook has a knowledge of DCs that have per-rack subnets and to treat them differently, but it needs to be aware of rows then it needs some refactoring and possible get the information live instead of being hardcoded.

Thanks. I think the plan really should be to keep the existing Ganeti logic, and not try to move the existing ganeti hosts to the per-rack vlans until we are in a position to move forward with T300152: Investigate Ganeti in routed mode. The logic we have at the POPs, with 2 racks, wouldn't be a good fit for our larger sites. We can support the legacy row-wide vlans until that is ready, and still migrate the remaining hosts to the new per-rack vlans.

• ayounsi closed subtask T348129: Create automation to move servers in Netbox from old to new switch as Resolved.Nov 7 2023, 9:13 AM

Change 973752 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add netboot config for new private vlans in codfw rows A/B

https://gerrit.wikimedia.org/r/973752

gerritbot added a project: Patch-For-Review.Nov 13 2023, 12:02 PM

Change 973752 merged by Cathal Mooney:

[operations/puppet@production] Add netboot config for new private vlans in codfw rows A/B

https://gerrit.wikimedia.org/r/973752

Maintenance_bot removed a project: Patch-For-Review.Nov 13 2023, 12:30 PM

JMeybohm mentioned this in T351074: Move servers from the appserver/api cluster to kubernetes.Nov 23 2023, 5:39 PM

Clement_Goubert changed the status of subtask T352883: Test IP-renumbering on kubestage2002.codfw.wmnet from Open to In Progress.Jan 8 2024, 3:37 PM

cmooney renamed this task from Plan codfw row A/B top-of-rack switch refresh to Codfw row A/B top-of-rack switch refresh.Jan 11 2024, 2:02 PM