Page MenuHomePhabricator

db2137 and es2026 don't get an IP via PXE boot
Closed, ResolvedPublic

Description

Trying to reimage db2137 fails cause the host isn't able to get an IP via PXE boot.
I can see the host attempting to PXE boot correctly:

CLIENT MAC ADDR: 2C EA 7F 3F E9 4B  GUID: 4C4C4544-0035-5810-8043-B1C04F523333
DHCP....\

However it timesout after a while and attempts to boot from disk.

Same thing happens with es2026:

CLIENT MAC ADDR: BC 97 E1 57 BA 98  GUID: 4C4C4544-0058-4D10-8043-B9C04F513533
DHCP....\

Event Timeline

The host was being reimaged into bookworm.
Please feel free to start the reimage yourself anytime.

By the way, the host is up with Bullseye if something needs to be checked locally.

Marostegui renamed this task from db2137 doesn't get an IP via PXE boot to db2137 and es2026 don't get an IP via PXE boot.Feb 21 2024, 12:19 PM
Marostegui updated the task description. (Show Details)

Same thing happens with es2026 - I just updated the task description

Do they have 10G NICs? Is the NIC firmware at the correct version? See Dell_Documentation#Urgent_Firmware_Revision_Notices

All our DBs have 1G and 10G ports, but we only use 1G ones:

root@es2026:~# ethtool eno3 | grep Speed
	Speed: 1000Mb/s
root@db2137:~# ethtool eno1 | grep Speed
	Speed: 1000Mb/s

These two hosts belong to racks that have been recabled lately
db2137 - B5 T355549
es2026 - A4 - T355863

@Marostegui this is a quirk with DHCP for devices connected to those new switches on the older vlans.

Best way forward is probably that we move them to new IP addressing on vlans private1-b5-codfw and private1-a4-codfw respectively. Alternately we can make a temporary config change to enable DHCP to work from that switch on legacy vlans (and change the IPs at a future stage).

@cmooney I am not sure I am following. Does this mean all the hosts migrated to those switches will fail to get reimaged until they are moved to the new vlans (which I believe implies changing IPs too?)?

@cmooney I am not sure I am following. Does this mean all the hosts migrated to those switches will fail to get reimaged until they are moved to the new vlans (which I believe implies changing IPs too?)?

Yeah that's the situation right now unfortunately, it's an edge case we didn't test for. For now we need to make some manual adjustments on the CRs to support reimage in one of the new racks.

Once we have the last servers migrated to the new switches (by Tues Mar 5th) we can make that change permanently on the CRs as the old switches won't be in use. In fact we can make the change permanent for row A after today's move of rack A8, so just row B will be affected from tomorrow through Mar 5th.

Thank you, that should definitely help.

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host es2026.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host es2026.codfw.wmnet with OS bookworm completed:

  • es2026 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402211444_cmooney_1282075_es2026.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host db2137.codfw.wmnet with OS bookworm

wiki_willy added a subscriber: Jhancock.wm.

++ @Jhancock.wm for visibility and in case any onsite support is needed

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host db2137.codfw.wmnet with OS bookworm completed:

  • db2137 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402211547_cmooney_1294951_db2137.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Marostegui assigned this task to cmooney.

All good, both hosts were reimaged fine. Thanks @cmooney for taking the time to explain and fix the issue.