Re-IP codfw private baremetal hosts to new per-rack vlans/subnets
Open, MediumPublic
Actions

Assigned To

None

Authored By

	cmooney
	Jan 11 2024, 2:38 PM

Description

Background

In 2024 Netops and DC-ops completed the upgrade of the network switches in all codfw racks to newer equipment.

The new switches are not configured as row-wide "virutal chassis", but instead are set up as individual elements, and are using EVPN/VXLAN to bridge the current row-wide vlans across multiple devices. The ultimate goal, however, is to migrate away from the row-wide vlans to per-rack vlans, matching the new network design similar to that used in Eqiad rows E and F. The end game is a simplified, more scalable network with a per-rack redundancy model.

We are now in a position to start moving hosts from the old vlans/subnets to new ones. This will require co-ordination between the various service owners and netops, and the exact process will be different for different types of hosts.

Additional automation will need to be developed to aid us in performing these changes.

Basic Networking Changes

At the most basic level the following would be required to renumber a host:

Depool and downtime the host so it is not serving any live traffic
Change netbox, assigning new IPs to host interfaces, and vlan configured on connected switch port (see T350152)
Adjust the following files on the host to reflect the new IPs and reboot the host:
1. /etc/network/interfaces
2. /etc/hosts
3. /etc/networks
Run the DNS cookbook to update DNS entries to the new IPs
Run the wipe-cache cookbook to clear DNS recursors cache for both direct and reverse records
Push the updated configuration to the switch to change connected vlan
Adjust other elements as needed for the given type of host to function with the new IP, for example:
1. DB grants are issued based on IP address
2. Swift clusters use IPs as identifiers
3. Cassandra instances use IPs directly
4. Servers with BGP peering to CRs should instead BGP peer to the top-of-rack directly
5. etc. etc.
Repool the server

Steps 1-5 are where we are focusing out automation efforts currently. Step 6 is the most difficult part of the process, and is where we need to engage with the different service owners to plan and test for each type of host we have.

We can create sub-tasks of this one to discuss and track the progress for all our various types of nodes.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T354869 Re-IP codfw private baremetal hosts to new per-rack vlans/subnets
Resolved		Papaul	T352883 Test IP-renumbering on kubestage2002.codfw.wmnet
Open		None	T350152 Automation to change a server's vlan
Resolved		Clement_Goubert	T352893 Update puppet's topology.kubernetes.io/zone logic to take into account the new setup
Open		None	T354871 Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B.
Open		None	T354872 Re-IP Swift hosts to per-rack subnets in codfw row A and B.
Open		None	T354878 Re-IP db servers in codfw row A/B moving to per-rack subnets
Open		Volans	T360029 Integrate dbctl IP changes as part of VLAN changes.
Resolved		cmooney	T352918 Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan
Resolved		cmooney	T352920 Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan
Declined		None	T360772 Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw
Resolved		akosiaris	T372878 Re-IP wikikube servers in codfw row A/B moving to per-rack subnets
Resolved		Jhancock.wm	T372916 Relabel codfw kubernetes nodes
Resolved		None	T373401 Relabel codfw kubernetes nodes
Resolved		Jhancock.wm	T373457 Relabel codfw kubernetes nodes
Duplicate		None	T373491 Relabel codfw kubernetes nodes
Duplicate		None	T373505 Relabel codfw kubernetes nodes
Resolved		Jhancock.wm	T373591 Relabel codfw kubernetes nodes
Duplicate		None	T373669 Relabel codfw kubernetes nodes mw2295,mw2296,mw2297
Duplicate		None	T373699 Relabel codfw kubernetes nodes mw237[789]
Resolved		MoritzMuehlenhoff	T373819 Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd
Resolved		Jhancock.wm	T373916 Relabel codfw kubernetes nodes
Invalid		None	T373934 Update iDRAC on mw2260.codfw.wmnet
Resolved	PRODUCTION ERROR	Clement_Goubert	T373982 wikikube-worker2080.codfw.wmnet can't auth to registry
Resolved		JMeybohm	T374019 kubernetes2035 (renamed to wikikube-worker2087) reporting "Comm Error: Backplane 0"
Resolved		Jhancock.wm	T374249 Relabel codfw kubernetes nodes
Resolved		Jhancock.wm	T374258 Comm Error: backplane 0 when reimaging wikikube-worker2095
Resolved		Jhancock.wm	T374380 Relabel codfw kubernetes nodes
Resolved		Jhancock.wm	T374622 Relabel codfw kubernetes nodes mw2390 and mw2394-mw2399
Resolved		Jhancock.wm	T374733 Relabel codfw kubernetes nodes
Resolved		Jhancock.wm	T375398 Relabel codfw kubernetes nodes mw2424 and mw2425
Resolved		None	T381244 Relabel codfw kubernetes nodes
Resolved		Jelto	T377374 reimage physical collab servers in legacy codfw VLANs
Resolved		Dzahn	T377396 PuppetFailure - phab2002
Open		Dzahn	T377643 SystemdUnitFailed - phab2002 - phabricator_task_dump / fix mysql grants for phab2002 after IP change
Open		None	T377534 Prepare/deploy new IPs for codfw cp nodes

Event Timeline

cmooney triaged this task as Medium priority.Jan 11 2024, 2:38 PM

cmooney created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2024, 2:38 PM

cmooney added a subtask: T352883: Test IP-renumbering on kubestage2002.codfw.wmnet.Jan 11 2024, 2:38 PM

cmooney added a subtask: T350152: Automation to change a server's vlan.

cmooney added a subtask: T352893: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup.

cmooney added a subtask: T354871: Re-IP hosts running Cassandra to per-rack subnets in codfw row A and B..Jan 11 2024, 2:46 PM

Volans updated the task description. (Show Details)Jan 11 2024, 3:42 PM

cmooney added a subtask: T354872: Re-IP Swift hosts to per-rack subnets in codfw row A and B..Jan 11 2024, 3:48 PM

cmooney added a subtask: T354878: Re-IP db servers in codfw row A/B moving to per-rack subnets.Jan 11 2024, 3:55 PM

Clement_Goubert mentioned this in T354791: Reclaim jobrunner hardware for k8s.Jan 18 2024, 12:45 PM

Clement_Goubert mentioned this in T351074: Move servers from the appserver/api cluster to kubernetes.

cmooney updated the task description. (Show Details)Jan 18 2024, 1:46 PM

cmooney added a subtask: T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan.Jan 18 2024, 5:54 PM

cmooney added a subtask: T352920: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan.

Clement_Goubert closed subtask T352893: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup as Resolved.Jan 22 2024, 10:33 AM

Clement_Goubert closed subtask T352883: Test IP-renumbering on kubestage2002.codfw.wmnet as Resolved.

cmooney closed subtask T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan as Resolved.Feb 29 2024, 9:00 PM

cmooney closed subtask T352920: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan as Resolved.Mar 5 2024, 7:06 PM

cmooney added a subtask: T360772: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw.Mar 22 2024, 12:58 PM

cmooney mentioned this in T327938: Codfw row A/B top-of-rack switch refresh.Mar 22 2024, 4:42 PM

Clement_Goubert changed the status of subtask T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets from Open to In Progress.Aug 26 2024, 1:32 PM

akosiaris mentioned this in T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.Sep 2 2024, 1:01 PM

cmooney closed subtask T360772: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw as Declined.Oct 10 2024, 12:08 PM

akosiaris closed subtask T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets as Resolved.Oct 16 2024, 2:31 PM

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host phab2002.codfw.wmnet with OS bullseye

• ayounsi renamed this task from Re-IP hosts on codfw row A and B to new per-rack vlans/subnets to Re-IP codfw private baremetal hosts to new per-rack vlans/subnets.Oct 17 2024, 7:46 AM

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host phab2002.codfw.wmnet with OS bullseye executed with errors:

phab2002 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Host successfully migrated to the new VLAN
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410161833_dzahn_2685975_phab2002.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details,You can also try typing "sudo install-console phab2002.codfw.wmnet" to get a root shellbut depending on the failure this may not work.

cmooney updated the task description. (Show Details)Oct 17 2024, 9:54 AM

Jelto closed subtask T377374: reimage physical collab servers in legacy codfw VLANs as Resolved.Oct 23 2024, 1:56 PM

Dzahn reopened subtask T377374: reimage physical collab servers in legacy codfw VLANs as In Progress.Oct 23 2024, 7:08 PM

Dzahn closed subtask T377374: reimage physical collab servers in legacy codfw VLANs as Resolved.

Re-IP codfw private baremetal hosts to new per-rack vlans/subnetsOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Re-IP codfw private baremetal hosts to new per-rack vlans/subnets
Open, MediumPublic
Actions

Related Objects
Search...