db2137 and es2026 don't get an IP via PXE boot
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Marostegui
	Feb 20 2024, 6:00 AM

Description

Trying to reimage db2137 fails cause the host isn't able to get an IP via PXE boot.
I can see the host attempting to PXE boot correctly:

CLIENT MAC ADDR: 2C EA 7F 3F E9 4B  GUID: 4C4C4544-0035-5810-8043-B1C04F523333
DHCP....\

However it timesout after a while and attempts to boot from disk.

Same thing happens with es2026:

CLIENT MAC ADDR: BC 97 E1 57 BA 98  GUID: 4C4C4544-0058-4D10-8043-B9C04F513533
DHCP....\

Related Objects
Search...

Status	Assigned	Task
Open	None	T356960 Upgrade hosts to MariaDB 10.6
Resolved	Marostegui	T358080 Upgrade es2 to MariaDB 10.6
Resolved	Marostegui	T354826 Re-arrange core multi-instance hosts
Resolved	cmooney	T357951 db2137 and es2026 don't get an IP via PXE boot

Event Timeline

Marostegui created this task.Feb 20 2024, 6:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 20 2024, 6:00 AM

The host was being reimaged into bookworm.
Please feel free to start the reimage yourself anytime.

ayounsi edited projects, added DC-Ops; removed netops, Infrastructure-Foundations.Feb 20 2024, 7:22 AM

Marostegui mentioned this in T354826: Re-arrange core multi-instance hosts.Feb 20 2024, 10:10 AM

By the way, the host is up with Bullseye if something needs to be checked locally.

Marostegui added a subscriber: wiki_willy.Feb 21 2024, 5:55 AM

Same thing happens with es2026 - I just updated the task description

Marostegui mentioned this in T358080: Upgrade es2 to MariaDB 10.6.Feb 21 2024, 12:20 PM

Marostegui added a parent task: T358080: Upgrade es2 to MariaDB 10.6.

Do they have 10G NICs? Is the NIC firmware at the correct version? See Dell_Documentation#Urgent_Firmware_Revision_Notices

All our DBs have 1G and 10G ports, but we only use 1G ones:

root@es2026:~# ethtool eno3 | grep Speed
	Speed: 1000Mb/s

root@db2137:~# ethtool eno1 | grep Speed
	Speed: 1000Mb/s

@wiki_willy could you help here? Thanks!

These two hosts belong to racks that have been recabled lately
db2137 - B5 T355549
es2026 - A4 - T355863

@Marostegui this is a quirk with DHCP for devices connected to those new switches on the older vlans.

Best way forward is probably that we move them to new IP addressing on vlans private1-b5-codfw and private1-a4-codfw respectively. Alternately we can make a temporary config change to enable DHCP to work from that switch on legacy vlans (and change the IPs at a future stage).

@cmooney I am not sure I am following. Does this mean all the hosts migrated to those switches will fail to get reimaged until they are moved to the new vlans (which I believe implies changing IPs too?)?

In T357951#9563402, @Marostegui wrote:

@cmooney I am not sure I am following. Does this mean all the hosts migrated to those switches will fail to get reimaged until they are moved to the new vlans (which I believe implies changing IPs too?)?

Yeah that's the situation right now unfortunately, it's an edge case we didn't test for. For now we need to make some manual adjustments on the CRs to support reimage in one of the new racks.

Once we have the last servers migrated to the new switches (by Tues Mar 5th) we can make that change permanently on the CRs as the old switches won't be in use. In fact we can make the change permanent for row A after today's move of rack A8, so just row B will be affected from tomorrow through Mar 5th.

Thank you, that should definitely help.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host es2026.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host es2026.codfw.wmnet with OS bookworm completed:

es2026 (WARN)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402211444_cmooney_1282075_es2026.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host db2137.codfw.wmnet with OS bookworm

++ @Jhancock.wm for visibility and in case any onsite support is needed

Jhancock.wm moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Feb 21 2024, 3:58 PM

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host db2137.codfw.wmnet with OS bookworm completed:

db2137 (WARN)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402211547_cmooney_1294951_db2137.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

All good, both hosts were reimaged fine. Thanks @cmooney for taking the time to explain and fix the issue.

Maintenance_bot added a project: SRE.Feb 21 2024, 4:29 PM

db2137 and es2026 don't get an IP via PXE bootClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

db2137 and es2026 don't get an IP via PXE boot
Closed, ResolvedPublic
Actions

Related Objects
Search...