Page MenuHomePhabricator

Re-IP codfw private baremetal hosts to new per-rack vlans/subnets
Open, MediumPublic

Description

Background

In 2024 Netops and DC-ops completed the upgrade of the network switches in all codfw racks to newer equipment.

The new switches are not configured as row-wide "virutal chassis", but instead are set up as individual elements, and are using EVPN/VXLAN to bridge the current row-wide vlans across multiple devices. The ultimate goal, however, is to migrate away from the row-wide vlans to per-rack vlans, matching the new network design similar to that used in Eqiad rows E and F. The end game is a simplified, more scalable network with a per-rack redundancy model.

We are now in a position to start moving hosts from the old vlans/subnets to new ones. This will require co-ordination between the various service owners and netops, and the exact process will be different for different types of hosts.

Additional automation will need to be developed to aid us in performing these changes.

Basic Networking Changes

At the most basic level the following would be required to renumber a host:

  1. Depool and downtime the host so it is not serving any live traffic
  2. Change netbox, assigning new IPs to host interfaces, and vlan configured on connected switch port (see T350152)
  3. Adjust the following files on the host to reflect the new IPs and reboot the host:
    1. /etc/network/interfaces
    2. /etc/hosts
    3. /etc/networks
  4. Run the DNS cookbook to update DNS entries to the new IPs
  5. Run the wipe-cache cookbook to clear DNS recursors cache for both direct and reverse records
  6. Push the updated configuration to the switch to change connected vlan
  7. Adjust other elements as needed for the given type of host to function with the new IP, for example:
    1. DB grants are issued based on IP address
    2. Swift clusters use IPs as identifiers
    3. Cassandra instances use IPs directly
    4. Servers with BGP peering to CRs should instead BGP peer to the top-of-rack directly
    5. etc. etc.
  8. Repool the server

Steps 1-5 are where we are focusing out automation efforts currently. Step 6 is the most difficult part of the process, and is where we need to engage with the different service owners to plan and test for each type of host we have.

We can create sub-tasks of this one to discuss and track the progress for all our various types of nodes.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedPapaul
OpenNone
ResolvedClement_Goubert
OpenNone
OpenNone
OpenNone
OpenVolans
Resolvedcmooney
Resolvedcmooney
DeclinedNone
Resolvedakosiaris
ResolvedJhancock.wm
ResolvedNone
ResolvedJhancock.wm
DuplicateNone
DuplicateNone
ResolvedJhancock.wm
DuplicateNone
DuplicateNone
ResolvedMoritzMuehlenhoff
ResolvedJhancock.wm
InvalidNone
ResolvedPRODUCTION ERRORClement_Goubert
ResolvedJMeybohm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedNone
ResolvedJelto
ResolvedDzahn
OpenDzahn
OpenNone

Event Timeline

cmooney triaged this task as Medium priority.Jan 11 2024, 2:38 PM
cmooney created this task.

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host phab2002.codfw.wmnet with OS bullseye

ayounsi renamed this task from Re-IP hosts on codfw row A and B to new per-rack vlans/subnets to Re-IP codfw private baremetal hosts to new per-rack vlans/subnets.Oct 17 2024, 7:46 AM

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host phab2002.codfw.wmnet with OS bullseye executed with errors:

  • phab2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Host successfully migrated to the new VLAN
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410161833_dzahn_2685975_phab2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console phab2002.codfw.wmnet" to get a root shellbut depending on the failure this may not work.