Maniphest T319042

PXE boot failure on cloudvirt1023
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Sep 30 2022, 4:10 PM

Description

In response to the RAID alert T319025 I adjusted the HW raid and tried to reimage this host. It turns out that it won't PXE boot, though, so I can't reimage.

Booting from BRCM MBA Slot AF00 v214.0.241.0

Broadcom UNDI PXE-2.1 v214.0.241.0
Copyright (C) 2000-2019 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: B0 26 28 29 6A E0  GUID: 4C4C4544-005A-5910-8059-C4C04F515032
PXE-E51: No DHCP or proxyDHCP offers were received.

PXE-M0F: Exiting Broadcom PXE ROM.

Papaul suggests that this might be resolved by upgrading the nic firmware to version 21.85

Details

	Subject	Repo	Branch	Lines +/-
	netboot cloudvirts: only preserve /srv on cloudvirt1028	operations/puppet	production	+3 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Jclark-ctr	T319001 Degraded RAID on cloudvirt1023
Resolved	Jclark-ctr	T319042 PXE boot failure on cloudvirt1023
Resolved	dcaro	T319043 NeutronAgentDown openstack.eqiad1.wikimediacloud.org:12345 A Neutron agent is down, VMs will have connectivity issues

Event Timeline

Andrew created this task.Sep 30 2022, 4:10 PM

• nskaggs mentioned this in T319029: NodeDown.Sep 30 2022, 4:13 PM

dcaro added a subtask: T319043: NeutronAgentDown openstack.eqiad1.wikimediacloud.org:12345 A Neutron agent is down, VMs will have connectivity issues.Oct 3 2022, 8:47 AM

Andrew mentioned this in T319001: Degraded RAID on cloudvirt1023.Oct 3 2022, 1:58 PM

@Andrew updated firmware to 21.85

Same behavior as before:

PXE-E51: No DHCP or proxyDHCP offers were received.

@andew if the server is not in production can i take a quick look at it

@ayounsi @cmooney looks like we are having a situation similar to https://phabricator.wikimedia.org/T303296. The server racked in B7 is sending request to the DHCP server but not getting a reply back. Can you please check this.

Thanks.

@Papaul ping me when you're around and I can walk you through it. TLDR is:
cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security option-82
In parallel I'll see if we can prioritize {T304677}

dcaro closed subtask T319043: NeutronAgentDown openstack.eqiad1.wikimediacloud.org:12345 A Neutron agent is down, VMs will have connectivity issues as Resolved.Oct 17 2022, 2:41 PM

@Papaul where you able to make any progress with @ayounsi

@Jclark-ctr @ayounsi is still looking into how he can prioritize T304677: Possible DHCP improvments

I'm not too worried about this particular host, but does it reflect an upcoming issue with all other cloudvirts, or a least all other cloudvirts on that rack?

Possible all cloudvirts in that rack. i think your guys were in the process of moving those nodes in dedicated cloud racks is it still doable?

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Oct 27 2022, 2:51 PM

I would not advise moving any of the cloudvirts other than 1023, since they're all likely to be decom'd next year (if not sooner) regardless. We /can/ move 1023 but my preference would be to have it just work in place until its upcoming replacement. @ayounsi can you advise about if/when it will be possible to rebuild that server, and livt the curse from the other hosts in that rack?

Cloudvirts in that rack are:

-cloudvirt1017
-cloudvirt1020
-cloudvirt1022
-cloudvirt1023

There are four other wmcs servers in the rack:

clouddumps2001
clouddb1016
cloudcephosd1001
cloudcephmon1001

The DHCP requests were making it to cloudsw1-c8 but not further. cloudsw1-c8 was not creating binding neither (so it was not processing them).

I enabled traceoptions for dhcpd:

[edit system]
+   processes {
+       dhcp-service {
+           traceoptions {
+               file dhcpd.log size 10m files 5;
+               level all;
+               flag all;
+           }
+       }
+   }

Which showed

Nov 9 19:04:05.910515 [MSTR][NOTE] [default:default][RLY][INET][irb.1118] jdhcpd_add_interface_or_option82: Option-82 found in packet from intf irb.1118, trust-option-82 override not configured - dropping

Of course that doesn't show up in the normal logs...

From there I enabled trust-option-82

And DHCP worked.

The reason is that the cloud-hosts vlan used to terminate directly on the core routers, while it now terminates on the cloudsw1-c8/d5 (c8 being the VRRP master) and the new DHCP relay daemon discards by default DHCP packets that come from other switches (here row B, but it would be the same with e4/f4) and have option 82 set.

So possibly using option 97 (T304677) would have fixed the issue too.

I manually pushed the change, and will add it to Homer tomorrow.

Change 855057 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] netboot cloudvirts: only preserve /srv on cloudvirt1028

https://gerrit.wikimedia.org/r/855057

gerritbot added a project: Patch-For-Review.Nov 9 2022, 7:53 PM

Change 855057 merged by Andrew Bogott:

[operations/puppet@production] netboot cloudvirts: only preserve /srv on cloudvirt1028

https://gerrit.wikimedia.org/r/855057

Maintenance_bot removed a project: Patch-For-Review.Nov 9 2022, 8:30 PM

Andrew closed this task as Resolved.Nov 9 2022, 9:14 PM