Background
In 2024 Netops and DC-ops completed the upgrade of the network switches in all codfw racks to newer equipment.
The new switches are not configured as row-wide "virutal chassis", but instead are set up as individual elements, and are using EVPN/VXLAN to bridge the current row-wide vlans across multiple devices. The ultimate goal, however, is to migrate away from the row-wide vlans to per-rack vlans, matching the new network design similar to that used in Eqiad rows E and F. The end game is a simplified, more scalable network with a per-rack redundancy model.
We are now in a position to start moving hosts from the old vlans/subnets to new ones. This will require co-ordination between the various service owners and netops, and the exact process will be different for different types of hosts.
Additional automation will need to be developed to aid us in performing these changes.
Basic Networking Changes
At the most basic level the following would be required to renumber a host:
- Depool and downtime the host so it is not serving any live traffic
- Change netbox, assigning new IPs to host interfaces, and vlan configured on connected switch port (see T350152)
- Adjust the following files on the host to reflect the new IPs and reboot the host:
- /etc/network/interfaces
- /etc/hosts
- /etc/networks
- Run the DNS cookbook to update DNS entries to the new IPs
- Run the wipe-cache cookbook to clear DNS recursors cache for both direct and reverse records
- Push the updated configuration to the switch to change connected vlan
- Adjust other elements as needed for the given type of host to function with the new IP, for example:
- DB grants are issued based on IP address
- Swift clusters use IPs as identifiers
- Cassandra instances use IPs directly
- Servers with BGP peering to CRs should instead BGP peer to the top-of-rack directly
- etc. etc.
- Repool the server
Steps 1-5 are where we are focusing out automation efforts currently. Step 6 is the most difficult part of the process, and is where we need to engage with the different service owners to plan and test for each type of host we have.
We can create sub-tasks of this one to discuss and track the progress for all our various types of nodes.