Page MenuHomePhabricator

PXE boot failure on cloudvirt1023
Closed, ResolvedPublic

Description

In response to the RAID alert T319025 I adjusted the HW raid and tried to reimage this host. It turns out that it won't PXE boot, though, so I can't reimage.

Booting from BRCM MBA Slot AF00 v214.0.241.0

Broadcom UNDI PXE-2.1 v214.0.241.0
Copyright (C) 2000-2019 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: B0 26 28 29 6A E0  GUID: 4C4C4544-005A-5910-8059-C4C04F515032
PXE-E51: No DHCP or proxyDHCP offers were received.

PXE-M0F: Exiting Broadcom PXE ROM.

Papaul suggests that this might be resolved by upgrading the nic firmware to version 21.85

Event Timeline

Same behavior as before:

PXE-E51: No DHCP or proxyDHCP offers were received.

@andew if the server is not in production can i take a quick look at it

@ayounsi @cmooney looks like we are having a situation similar to https://phabricator.wikimedia.org/T303296. The server racked in B7 is sending request to the DHCP server but not getting a reply back. Can you please check this.

Thanks.

@Papaul ping me when you're around and I can walk you through it. TLDR is:
cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security option-82
In parallel I'll see if we can prioritize {T304677}

@Papaul where you able to make any progress with @ayounsi

@Jclark-ctr @ayounsi is still looking into how he can prioritize T304677: Possible DHCP improvments

I'm not too worried about this particular host, but does it reflect an upcoming issue with all other cloudvirts, or a least all other cloudvirts on that rack?

Possible all cloudvirts in that rack. i think your guys were in the process of moving those nodes in dedicated cloud racks is it still doable?

I would not advise moving any of the cloudvirts other than 1023, since they're all likely to be decom'd next year (if not sooner) regardless. We /can/ move 1023 but my preference would be to have it just work in place until its upcoming replacement. @ayounsi can you advise about if/when it will be possible to rebuild that server, and livt the curse from the other hosts in that rack?

Cloudvirts in that rack are:

-cloudvirt1017
-cloudvirt1020
-cloudvirt1022
-cloudvirt1023

There are four other wmcs servers in the rack:

  • clouddumps2001
  • clouddb1016
  • cloudcephosd1001
  • cloudcephmon1001

The DHCP requests were making it to cloudsw1-c8 but not further. cloudsw1-c8 was not creating binding neither (so it was not processing them).

I enabled traceoptions for dhcpd:

[edit system]
+   processes {
+       dhcp-service {
+           traceoptions {
+               file dhcpd.log size 10m files 5;
+               level all;
+               flag all;
+           }
+       }
+   }

Which showed

Nov 9 19:04:05.910515 [MSTR][NOTE] [default:default][RLY][INET][irb.1118] jdhcpd_add_interface_or_option82: Option-82 found in packet from intf irb.1118, trust-option-82 override not configured - dropping

Of course that doesn't show up in the normal logs...

From there I enabled trust-option-82

And DHCP worked.

The reason is that the cloud-hosts vlan used to terminate directly on the core routers, while it now terminates on the cloudsw1-c8/d5 (c8 being the VRRP master) and the new DHCP relay daemon discards by default DHCP packets that come from other switches (here row B, but it would be the same with e4/f4) and have option 82 set.

So possibly using option 97 (T304677) would have fixed the issue too.

I manually pushed the change, and will add it to Homer tomorrow.

Change 855057 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] netboot cloudvirts: only preserve /srv on cloudvirt1028

https://gerrit.wikimedia.org/r/855057

Change 855057 merged by Andrew Bogott:

[operations/puppet@production] netboot cloudvirts: only preserve /srv on cloudvirt1028

https://gerrit.wikimedia.org/r/855057