Page MenuHomePhabricator

Multiple hosts in codfw fail to PXE boot upon reimage
Closed, ResolvedPublic

Description

These hosts fail to get an IP address when PXE booting - the ipmi commands to enable pxe boot are successful, but the DHCP request times out and the hosts boot off disk.

  • mw2379.codfw.wmnet
  • mw2380.codfw.wmnet
  • mw2383.codfw.wmnet

Perhaps similar to T355333

Event Timeline

All of these are connected to lsw1-a3-codfw (new L3 switch) and they may be the first we've tried to reimage connected to new switch.

Investigating if it is related/some teething problem with the new setup.

I think what's happening is the new switch is not configured to insert the port information for DHCP requests over the legacy row-wide vlan.

Best way forward is to change them to the new per-rack vlan and retry, I discussed briefly with @hnowlan on irc and we'll try that now.

cmooney claimed this task.

Yeah the issue here was the hosts being connected to the new switches, but still configured for the legacy vlan.

That's fine, but when re-imaging we need to change them to the new per-rack vlan to support. There is an almost-ready cookbook for this we will try to get over the line in the next few days (see here), until it's merged we will need to manually move the server IPs in Netbox (happy to do so just ping me).