Page MenuHomePhabricator

Reimage cookbook on new eqiad hosts stuck at PXE booting
Closed, ResolvedPublicBUG REPORT

Description

During the reimage of new cp hosts in eqiad (see T342159 and T349244) we noticed a odd behavior: on the first 3 hosts (cp110[0-2]) the reimage cookbook (eg. sudo -i cookbook sre.hosts.reimage --os bullseye --new cp1100 -t T349244) will wait until timeout after the first reboot:

[snip]
2023-10-31 16:32:04,251 [INFO] dhcp config test passed!
2023-10-31 16:32:06,333 [INFO] reloaded isc-dhcp-server
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Released lock for key /spicerack/locks/modules/spicerack.dhcp.DHCP:eqiad: {'concurrency': 1, 'created': '2023-10-31 16:32:02.959747', 'owner': 'fabfur@cumin1001 [94491]', 'ttl': 120}
Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis bootparam set bootflag force_pxe options=reset
Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Forced PXE for next reboot
Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis power status
Running IPMI command: ipmitool -I lanplus -H cp1104.mgmt.eqiad.wmnet -U root -E chassis power cycle
Host rebooted via IPMI
[1/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cp1104.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[2/240, retrying in 10.00s] Attempt to run 'spicerack.remote.RemoteHosts.wait_reboot_since' raised: Reboot for cp1104.eqiad.wmnet not found yet, keep polling for it: unable to get uptime
Caused by: Cumin execution failed (exit_code=2)
[snip]

The hosts console meanwhile is "stuck" at

Booting from BRCM MBA Slot 4B00 v218.0.219.1

Broadcom UNDI PXE-2.1 v218.0.219.1
Copyright (C) 2000-2020 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

If the cookbook is manually stopped and re-launched the installation seems to work fine until the end.
This could indicate an issue that could be solved by a "manual" reboot after this first step.

We're however experiencing a different behavior on cp1103-1104 (the latest hosts we're reinstalling just now). Seems that on those hosts even with the "double reboot" the cookbook wouldn't proceed with the installation as the previous ones.

If you need some help on our side to debug this behavior don't hesitate to contact us!

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for the comment and debugging @Volans! Adding some more points from the Traffic team:

  • on some of the hosts that were failing and took repeated attempts, we verified the following:
    • correct version of firmware on the NICs
    • correct order of PXE boot (on the first integrated NIC), no PXE boot enabled on the embedded NIC
    • verified links are up
    • on some hosts we confirmed the BIOS settings that might be related to any of this, generally

One other point is that we not only observed this on the cp hosts in eqiad but also on ulsfo, specifically cp4052 that took 4-5 attempts to reimage.

Adding @cmooney / @ayounsi to this task so they can check the switch side -- thanks folks!

Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye completed:

  • cp1108 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311092013_volans_1816800_cp1108.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Can you ping me when you're around so we can have a look?
afaik nothing changed on the switch side.

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors:

  • cp1109 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors:

  • cp1109 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors:

  • cp1109 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye completed:

  • cp1109 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311101035_fabfur_2187475_cp1109.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I hit this problem when re-imaging ms-fe* nodes (for T317616). Most of them PXE booted fine, but two didn't - ms-fe2014.codfw.wmnet needed one further reboot (which I did from the HTML console) before it would PXE, and ms-fe1013.eqiad.wmnet needed two further reboots - i.e. it wedged twice at the same point before finally PXEing properly.

I can confirm essentially identical failure mode, the console said:

Booting from BRCM MBA Slot 8A00 v218.0.219.1

Broadcom UNDI PXE-2.1 v218.0.219.1
Copyright (C) 2000-2020 Broadcom Limited
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

and that was that - seemingly no attempt made to DHCP/PXE at all.

Can you try this? T348119#9224341

Fun fact, I found that task on Google after starting to look for that specific Broadcom PXE string.

@ayounsi: Just as another data point, I did check this (twice for many cp hosts) and all had the correct boot order. Someone should confirm this again but we did for the cp ones.

I can try if/when I get another one that fails (I'd be surprised if that were the solution, given "enough reboots" seems to have worked with the troublesome nodes I've had so far...)

Mentioned in SAL (#wikimedia-operations) [2023-12-06T15:23:42Z] <sukhe> depool cp4037 for reimage testing: T350179

@Volans @ssingh asked me to take a look at the issue to see what i can find.
working on cp4037
Test1
when I start the reimage cookbook, the cookbook reboots the server after the reboot the server get stuck with noting on the console. tcpdump on the install shows nothing from cp4037 (test done 2 time with issue)
Test2
start the dhcp cookbook, manually reboot the server and press F12 the server pxe without an issue and you can see the request on the install server when you run tcpdump (test done 3 times no issue)

@Papaul another test we could do is use the dhcp cookbook and then try to reboot into PXE using remote IPMI like the cookbook does.
The cookbook does this:

ipmitool -I lanplus -H HOST.mgmt.DC.wmnet -U root -E chassis bootparam set bootflag force_pxe options=reset

Then it gets:

ipmitool -I lanplus -H HOST.mgmt.DC.wmnet -U root -E chassis bootparam get 5

and checks that the line Boot Device Selector has Force PXE as value.

The it does

ipmitool -I lanplus -H HOST.mgmt.DC.wmnet -U root -E chassis power status
ipmitool -I lanplus -H HOST.mgmt.DC.wmnet -U root -E chassis power cycle  # or on if the power is off

and later to be sure to reset the Force PXE:

ipmitool -I lanplus -H ms-be1082.mgmt.eqiad.wmnet -U root -E chassis bootparam set bootflag none options=reset

@Volans after i enter the mgmt password the only line i get is

Set Boot Device to force_pxe

@Volans did the test 4 times. the first 2 times the server did pxe boot but the last 2 times it didn't

Interesting... I guess we could try to do the same test with redfish API instead and see if that works all the time and consider converting the ipmi calls to redfish...

After discussions in yesterday's office hours it seems that remote IPMI is working correctly as the host does reboot and does try to boot via pxe and gets stuck, hence the Redfish test is unnecessary.

After discussions in yesterday's office hours it seems that remote IPMI is working correctly as the host does reboot and does try to boot via pxe and gets stuck, hence the Redfish test is unnecessary.

The info we have suggests it enters the PXE process, but does not attempt the first step, DHCP. I think we need to get to the bottom of why that is. We're also told that manually requesting PXE boot works. So there is some difference in "pxe requested by IPMI" vs. "PXE requested manually". So maybe "PXE requested by redfish" would also be different. Clutching at straws but might be worth trying still.

Let's do a deep dive when it next happens and see if we can spot anything, I agree redfish should be down on the priority list of things to try.

The plural of anecdote is not data, but: I had one system (ms-fe2013) that did this when rebooted by the reimage cookbook; I did a cold power cycle and hit F12 for PXE boot and it again didn't DHCP; I then set "PXE" from the Boot menu (via the HTML iDRAC interface) and did a warm boot, and then it PXEd OK.

The console supports your "doesn't attempt to DHCP" theory (in that you don't get the usual bit of text that pops up as it starts to DHCP).

Hi folks: Just wondering if there is a path forward on this task as we hit the same issue last week while reimaging cp4052. No PXE boot during the first reimage attempt but it worked the second time.

I guess my primary concern is that we have a series of hardware refreshes upcoming but most importantly, we will need this to work without any issues when setting up magru in April. Running the cookbook 2-3 times slows down the process quite significantly and is of course a problem in general.

That being said, I actually want to be helpful here and not just offload this work to other teams. What can we do to move forward with a resolution on this?

One possible path forward is to work with Dell's support to solve T304483: PXE boot NIC firmware regression

Another one is to narrow down the characteristics causing this issue (server models, firmware versions, etc).

Last maybe we could explore relying less on PXE, for example is it possible to pass the host and tftp server IP through redfish ?

Traffic has been reimaging hosts in esams (we have done three so far for T360430) and we observed that we didn't have this issue on any of those hosts. Relatedly, we reimaged cp4052 last week where we did hit the issue again.

To figure out why we didn't have this issue in esams, I spent some time trying to see if we can isolate some differentiating factor between these hosts (all of them with the same NIC, Broadcom BCM57414). My first thought was to compare the NIC firmware versions but they seem to be same (21.85.21.92 across all affected hosts) so there is no difference there.

As a reminder:

  • three cp hosts in esams have worked so far without any issue
  • no hosts worked in the eqiad refresh T349244 without running the cookbook 2-3 times before they would do PXE boot
  • in ulsfo, we have had persistent issues with cp4052 at least, so that's one confirmation there

That leads us to the iDRAC firmware version which I think is the differentiating issue here. To confirm that, I ran some PQL queries, trying to see if something varies between the hosts in esams, eqiad, and ulsfo. And also including ms-fe1013 as pointed out by @MatthewVernon above which seemed to have the same issue.

>>> from pypuppetdb import connect
>>> db = connect()
>>> pql = 'inventory[facts.firmware_idrac,certname] { nodes { certname ~ "cp30.*.esams.wmnet" } }'
>>> results = db.pql(pql)
>>> for result in results: print(result)
... 
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3078.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3071.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3077.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3069.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3070.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3066.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3075.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3079.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3074.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3080.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3067.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3068.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3081.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3072.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3076.esams.wmnet'}
{'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3073.esams.wmnet'}

As you can see, all hosts in esams have the same iDRAC firmware version, 6.10.30.20.

Running the same for the cp hosts in eqiad where not a single host worked on the first attempt gives us:

>>> pql = 'inventory[facts.firmware_idrac,certname] { nodes { certname ~ "cp110.*.eqiad.wmnet" } }'
... 
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1107.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1104.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1103.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1105.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1109.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1101.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1106.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1102.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1100.eqiad.wmnet'}
{'facts.firmware_idrac': '7.00.00.00', 'certname': 'cp1108.eqiad.wmnet'}

And then running it for cp4052.ulsfo.wmnet and ms-fe1013.eqiad.wmnet where it failed as well:

>>> pql = 'inventory[facts.firmware_idrac,certname] { nodes { certname = "cp4052.ulsfo.wmnet" or certname = "ms-fe1013.eqiad.wmnet" } }'
...
{'facts.firmware_idrac': '6.00.02.00', 'certname': 'cp4052.ulsfo.wmnet'}
{'facts.firmware_idrac': '6.10.00.00', 'certname': 'ms-fe1013.eqiad.wmnet'}

I suspect based on the above that iDRAC version 7.x is the one causing issues but also maybe if we try 6.10.30.20 and just stick to that, it might be the solution to this issue, taking into account the versions on cp4052 and ms-fe.

The next step to try here is to upgrade cp4052's iDRAC to 6.10.30.20, reimage, and see if it works in the first attempt. If it does, that's a good confirmation of the resolution of this issue.

Update: I ran the firmware-upgrade cookbook on cp4052 and updated it's firmware to 6.10.30.20, did a racreset to be absolutely sure and it still failed for me in the first attempt. It seems my joy that this issue has been resolved was short-lived so we move on and try to debug it more :)

Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of lost here on what to check next. The worry continues to be that magru is coming close and we should not be running the cookbook multiple times to get a reimage done. esams is working fine for us without any issues but ulsfo continues to be an issue and that uncertainty extends to magru as well.

Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of lost here on what to check next.

Yeah it's very odd alright. That pattern of firmware versions looked so promising - nice sleuthing all the same!

esams is working fine for us without any issues but ulsfo continues to be an issue and that uncertainty extends to magru as well.

100% clutching at straws here but I wonder if the type of switch is having any effect? esams and drmrs have newer QFX5120 switches (I guess small good news here is magru will have same setup as these locations). ulsfo and (most of) eqiad have older QFX5100 devices. I fail to see why that would make such a difference, but we're at the clutching at straws level so maybe?

Eqiad rows E and F, and codfw rows A and B, have the newer model switches too. So if we have the same problem in those places we can rule out it being anything to do with the switch.

Last maybe we could explore relying less on PXE, for example is it possible to pass the host and tftp server IP through redfish ?

I'd a quick search online and there is a UEFI 'HTTP Boot' (also see here) framework, but it also relies on DHCP. Might be worth investigating, but seeing as the issue appears to be the system failing to initiate the DHCP process during PXE, no guarantee the same thing wouldn't happen trying to use this instead.

Last maybe we could explore relying less on PXE, for example is it possible to pass the host and tftp server IP through redfish ?

Actually there could be a way to mount a 'virtual CD/USB' drive over HTTP. That would leverage the iDRAC's management networking, and then boot the system. This thread discusses it but the Dell staff seem to think it's probably not achievable to set up via Redfish.

We could also consider to pass this over to Dell support?

Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of lost here on what to check next.

Yeah it's very odd alright. That pattern of firmware versions looked so promising - nice sleuthing all the same!

esams is working fine for us without any issues but ulsfo continues to be an issue and that uncertainty extends to magru as well.

100% clutching at straws here but I wonder if the type of switch is having any effect? esams and drmrs have newer QFX5120 switches (I guess small good news here is magru will have same setup as these locations). ulsfo and (most of) eqiad have older QFX5100 devices. I fail to see why that would make such a difference, but we're at the clutching at straws level so maybe?

Eqiad rows E and F, and codfw rows A and B, have the newer model switches too. So if we have the same problem in those places we can rule out it being anything to do with the switch.

All cp hosts in eqiad are in rows A, B, C, and D, so that does look worth trying out I guess! Can you remind when and if this transition was made in the recent past?

codfw rows A and B have the cp hosts so that's certainly worth trying again but also codfw has old-er hardware at this point so I am not sure if that's not tainted by that fact but it's worth a shot. I will also try a host in drmrs given it has also the new switch. I think this is certainly worth trying!

We could also consider to pass this over to Dell support?

My only concern is that until we have unified a few things (such as firmwares above) and made sure that we understand the differentiating factors across sites that makes this work in some sites such as esams but not in others such as eqiad, I am not even sure what we will report to Dell. But if people think that's worth it, I am happy to write up something.

All cp hosts in eqiad are in rows A, B, C, and D, so that does look worth trying out I guess! Can you remind when and if this transition was made in the recent past?

Rows E and F were only added in 2022 and had the newer model switches from day one.

Rows A and B in codfw were upgraded last quarter from old to new model.

codfw rows A and B have the cp hosts so that's certainly worth trying again but also codfw has old-er hardware at this point so I am not sure if that's not tainted by that fact but it's worth a shot. I will also try a host in drmrs given it has also the new switch. I think this is certainly worth trying!

Worth trying.

Also, really unlikely possibility is some odd thing with DAC cables or a batch we got? Maybe some odd race-condition where it takes like 10ms longer to electrically activate the link or something? I think it's 0.0001% chance, I will eat my hat if it turned out to be that tbh.

We could also consider to pass this over to Dell support?

My only concern is that until we have unified a few things (such as firmwares above) and made sure that we understand the differentiating factors across sites that makes this work in some sites such as esams but not in others such as eqiad, I am not even sure what we will report to Dell. But if people think that's worth it, I am happy to write up something.

I think it's probably worth raising. As I understand the BIOS fails to execute the DHCP part of the PXEboot process when this happens? And the network link in this case shows 'up' (unlike the issue where the link doesn't activate within the debian installer on a certain NIC firmware).

If you like we can do one and monitor the events on the switch-side, as well as virtual VGA and virtual console, and try to get some data for a case?

Continuing to trying to isolate the possible causes of this, I noticed when dumping the facter output between the difference hosts (ones that work vs the ones that don't), that the BIOS versions also seems to vary:

eqiad BIOS version
{'facts.bios_version': '1.10.2', 'certname': 'cp1110.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1114.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1103.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1107.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1113.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1111.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1115.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1104.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1105.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1109.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1112.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1101.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1106.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1102.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1100.eqiad.wmnet'}
{'facts.bios_version': '1.10.2', 'certname': 'cp1108.eqiad.wmnet'}

esams BIOS version
{'facts.bios_version': '1.9.2', 'certname': 'cp3078.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3071.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3077.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3069.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3070.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3066.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3075.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3079.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3074.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3080.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3067.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3068.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3081.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3072.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3076.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'certname': 'cp3073.esams.wmnet'}

ulsfo BIOS version
{'facts.bios_version': '1.6.5', 'certname': 'cp4037.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4047.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4050.ulsfo.wmnet'}
{'facts.bios_version': '1.7.5', 'certname': 'cp4052.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4048.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4049.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4039.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4041.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4044.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4038.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4045.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4040.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4051.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4043.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4042.ulsfo.wmnet'}
{'facts.bios_version': '1.6.5', 'certname': 'cp4046.ulsfo.wmnet'}

This is definitely not the first time where one firmware version makes or breaks it for us so I will try to upgrade the BIOS for cp4052 and report back.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp3069.esams.wmnet with OS bullseye

cp3069 also did PXE boot successfully, in the first attempt, so it makes it the fourth host in esams to not have any issue. I think maybe focusing on why it works in esams but not in eqiad/ulsfo might be the way forward.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp3069.esams.wmnet with OS bullseye completed:

  • cp3069 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404081451_sukhe_1385322_cp3069.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

cp4052 BIOS version 1.9.2 also didn't work; no PXE boot. I am going to focus on the install server now and see if we can pick up something there.

Mentioned in SAL (#wikimedia-operations) [2024-04-09T16:16:47Z] <sukhe> depool cp1113 for PXE boot issue related testing T350179

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS bullseye executed with errors:

  • cp1113 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp1113.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1113.eqiad.wmnet with OS bullseye completed:

  • cp1113 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404091656_sukhe_1628989_cp1113.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Some updates with the TL;DR that it is still failing for hosts in eqiad and ulsfo:

I was talking with @ayounsi and he suggested in the context of T304483 to upgrade the NIC firmware to the very latest available version (22.x) instead of sticking to (21.x).

On cp4052, after doing so, the host reimaged in the first attempt. The big caveat here is that as we have seen in the past, a successful reimage earlier (which is what happened in the case of cp4052 in the morning) means you will most likely get a successful later as well for some period of time. (Don't ask me how much, it is all random and we haven't been able to pin point it). I will try cp4052 again tomorrow.

I then repeated the same for cp1113, a host that failed multiple times in T349244, upgrading the firmware to 22.71.3 (similar to cp4052 above) and it failed again the first time and worked the next.

If there is some sort of caching going on here where it fails the first, then caches something the next, the question still remains on what is it that causes this caching and where.

One more thing I will try to do is to successively try all NIC firmwares in 22.x instead of picking the highest supported version but we can't do multiple reimages in a day as it clears the caches and while one host being down is not a big deal, it's not ideal. So I will report back tomorrow.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp4052.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp4052.ulsfo.wmnet with OS bullseye completed:

  • cp4052 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101333_sukhe_1811204_cp4052.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-10T17:04:49Z] <sukhe> depool cp1115 for firmware downgrade for PXE boot testing: T350179

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bullseye executed with errors:

  • cp1115 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp1115.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1115.eqiad.wmnet with OS bullseye completed:

  • cp1115 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404101808_sukhe_1849805_cp1115.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

For cp1115 that we tried today, I downgraded the BIOS, NIC and iDRAC firmwares, to match what we have in esams, where 6/6 hosts have been reimaged without any issue (PXE-booting the first time).

{'facts.bios_version': '1.9.2', 'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp1115.eqiad.wmnet'}

esams:

{'facts.bios_version': '1.9.2', 'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3066.esams.wmnet'}
{'facts.bios_version': '1.9.2', 'facts.firmware_idrac': '6.10.30.20', 'certname': 'cp3067.esams.wmnet'}

facts.net_drivers also points to the same NIC firmware:

'facts.net_driver': {'idrac': {'speed': -1, 'driver': 'cdc_ether', 'duplex': 'unknown', 'firmware_version': 'CDC Ethernet Device'}, 'eno8303': {'speed': -1, 'driver': 'tg3', 'duplex': 'unknown', 'firmware_version': 'FFV22.31.6 bc 5720-v1.39'}, 'eno8403': {'speed': -1, 'driver': 'tg3', 'duplex': 'unknown', 'firmware_version': 'FFV22.31.6 bc 5720-v1.39'}, 'eno12399np0': {'speed': 25000, 'driver': 'bnxt_en', 'duplex': 'full', 'firmware_version': '218.0.219.13/pkg 21.85.21.92'}, 'eno12409np1': {'speed': -1, 'driver': 'bnxt_en', 'duplex': 'unknown', 'firmware_version': '218.0.219.13/pkg 21.85.21.92'}},

So we have firmware version parity on the same hardware, so I think we can safely exclude this.

Traffic reimaged 8 text nodes in esams and all of them PXE-booted the first time, without any issues. I think looking at why things worked flawlessly in esams but not in other sites such as eqiad and ulsfo is probably how we should try to get to the bottom of this ticket!

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp2042.codfw.wmnet with OS bullseye

@Papaul suggested to try a host in codfw and cp2042 PXE booted successfully. In one of the above messages, @cmooney suggested looking at if the new QFX5120 switch in esams/drmrs can be why it works in esams. All cp hosts in eqiad are in rows A-D and not in E or F (with the new switch), so that's not an option.

I will reimage a drmrs host as well.

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp2042.codfw.wmnet with OS bullseye completed:

  • cp2042 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404111406_sukhe_2053536_cp2042.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-12T14:08:10Z] <sukhe> depool cp1115 for PXE boot issue testing: T350179

@ssingh one thing that I found between the server NiC and the switch interface is the vendor . In Eqiad, I checked 3 nodes cp1115, 1113 and 1100 all have for vendor under Transceiver inventory W2W and in Esams the vendor is FS. Since @ayounsi mentioned this morning that the request was not reaching the switch I focused on the media type used in esams and in eqiad so it looks like both connections are Direct Attach Copper but different vendor.

What i will like to test next

  • use a DAC from FS
  • just use a transceiver and connect a fiber

@Jclark-ctr @VRiley-WMF if next on site can you please find a DAC cable from FS.com and replace the cable on cp1115 if you have no DAC cable from FS.com can you please use a 10G transceiver with a fiber.

@ssingh I checked also cp2042 we are using FS.com DAC.
FYI W2W= Wave2Wave

@ssingh one thing that I found between the server NiC and the switch interface is the vendor . In Eqiad, I checked 3 nodes cp1115, 1113 and 1100 all have for vendor under Transceiver inventory W2W and in Esams the vendor is FS. Since @ayounsi mentioned this morning that the request was not reaching the switch I focused on the media type used in esams and in eqiad so it looks like both connections are Direct Attach Copper but different vendor.

What i will like to test next

  • use a DAC from FS
  • just use a transceiver and connect a fiber

@Jclark-ctr @VRiley-WMF if next on site can you please find a DAC cable from FS.com and replace the cable on cp1115 if you have no DAC cable from FS.com can you please use a 10G transceiver with a fiber.

Thanks @Papaul, that's quite some investigation! I have no idea though how it can be affecting this given that it works the next time but I do think this is worth pursuing. How do I find out the type of DAC used? I wanted to check a few other hosts affected by this and it doesn't seem Netbox has this information.

@cmooney above said:

Also, really unlikely possibility is some odd thing with DAC cables or a batch we got? Maybe some odd race-condition where it takes like 10ms longer to electrically activate the link or something? I think it's 0.0001% chance, I will eat my hat if it turned out to be that tbh.

We should get a hat for him if this turns out to be true :)

@Jclark-ctr, @VRiley-WMF: The host is depooled so please feel free to proceed when you can. Thanks!

@ssingh unfortunately using the fs DAC didn't fix the issue. So we are back to zero. I am still working on it

@ssingh After 2 days working on this issue, I finally got at the bottom of the of problem. After many reboots on cp1115, I checked the model of the NIC (Broadcom 57414) and decided to test every single firmware available on Dell web site.
All the versions 22.xx , server pxe boot but give you the error "Failed to load ldlinux.c32"
versions 21.8x server boots sometimes and other times gets stuck
The last version, version 21.60.22.11 which was not listed on Dell product-support web site https://www.dell.com/support/home/en-us/product-support/servicetag/0-bTkxNWhsYWF2OFdQRm04TmF3QjhwZz090/drivers
is the only working version. I installed this version and reboot cp1115 six times and all the six times it did reboot in pxe without an issue.

@Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host in eqiad and then try it out. Because what happens sometimes is that if a given host reimaged successfully once then it continues to reimage successfully for some more period of time (we don't know what that is but at least the same day :).

I guess I still don't know why esams with 21.85 continues to boot correctly but maybe the difference in speed? I will just ignore that bit if this works for eqiad.

Mentioned in SAL (#wikimedia-operations) [2024-04-17T14:20:36Z] <sukhe> depool cp1114.eqiad.wmnet for PXE boot testing issues and downgrade NIC firmware: T350179

Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bullseye

@Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, Network_Firmware_YK81Y_WN64_21.60.22.11_03) worked in the first attempt when reimaging cp1114. I think we can consider this closed given we have observed the fix on two hosts now.

Thanks, 1-800-Call-Papaul!

@Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, Network_Firmware_YK81Y_WN64_21.60.22.11_03) worked in the first attempt when reimaging cp1114. I think we can consider this closed given we have observed the fix on two hosts now.

Thanks, 1-800-Call-Papaul!

Fully agreed on the "Papaul is awesome" part!

So if the current outcome is that only an old firmware version no longer available on the Dell website makes this work, let's escalate this to Dell, so that they make sure it's also working in the current firmware releases?

+1 to thanks to Papaul for getting to the bottom of this!

Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp1114.eqiad.wmnet with OS bullseye completed:

  • cp1114 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404171531_sukhe_3219677_cp1114.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I'd like to join the chorus of thanks to Papaul, you resolved us a very nasty and long running issue here! Thanks again!

I'd like to join the chorus of thanks to Papaul, you resolved us a very nasty and long running issue here! Thanks again!

Second that, nice work!

...let's escalate this to Dell, so that they make sure it's also working in the current firmware releases?

Yeah this is really an unsatisfactory situation. It is different to the issue we have seen with the Broadcom BCM57412 10G NIC (which we've had to use firmware 21.85.21.92 with). With that the reimage issue only occurs once the system has booted into the debian installer environment, and thus the kernel we have loaded and driver it is using are part of the equation.

This problem occurs at the initial PXEboot stage initialized by the BIOS/system itself. There is no software/firmware running other than that on-board the system itself, so the failure is totally within Dell's area of responsibility.

Are we planning to open a different task to document the pattern we have observed and follow up with Dell on this? We should maybe prep a test system with this NIC. The support process will probably involve them requesting us to try various things (perhaps we'd get lucky and they'd acknowledge a known bug immediately but doubtful). So it would be good to have a system ready to test/reproduce the problem on.

Lastly I see we have this same NIC in esams and drmrs. We didn't seem to have the same issue there. Is the feeling that this issue only happens with the BCM57414 25G card with a SFP+ (10G) module? Did we do any tests on the same hosts, with known "bad" firmware version, and a SFP28 (25G) DAC in place? Worth a shot if not, it would help for the support ticket if we had a better sense what the difference is here, versus the places this card has worked reliably on 21.85.21.92.