Page MenuHomePhabricator

eqsin hosts are not rebooting when running sre.hosts.reimage cookbook
Closed, ResolvedPublic

Description

Attempting to image cp5017 and cp5025 (eqsin) randomly ends with infinite waiting periods when the cookbook issues a reboot. Re-running the cookbook will occasionally yield success. It appears that other DCs are not affected according to @ssingh

The final lines before cookbook infinitely waits are:

Running IPMI command: ipmitool -I lanplus -H cp5025.mgmt.eqsin.wmnet -U root -E chassis power cycle                                                                                                                  
Host rebooted via IPMI

No useful output can be found in the console when the server is stuck.

Is there a connectivity issue preventing a successful reconnection after reboot?

Event Timeline

This task was brought to my attention by @ssingh today because cp4037 did the same.
It was reimaged first around 12:15 and it failed, and then again around 14:07 and it succeeded.

Looking at the DHCP logs I don't see any request coming for the first attempt. The first DHCPDISCOVER on install4001 is at Jan 25 14:11:03.
One of the possible explanation is that the server didn't' actually reboot at all or didn't reboot into PXE. As historically we've seen some cases like these in the past. But in both cases at the console you should have seen a normal login prompt or reboot.

The cookbook sets the PXE bit via IPMI (we want to move to Redfish once the fleet has a recent enough version everywhere that we trust it working reliably) and then checks it via the chassis bootparam get 5 IPMI command that retuned:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 8004000000
 Boot Flags :
   - Boot Flag Valid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

And those are exactly the same values reported by the second attempt that worked fine.

From the HW logs there is a reboot but seems ~1h before the actual reimage:

--------------------------------------------------------------------------------
SeqNumber       = 396
Message ID      = SYS1005
Category        = Audit
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2023-01-25 11:14:22
Message         = The server power action is initiated because the management controller initiated a power-down operation.
Message Arg   1 = the management controller initiated a power-down operation
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 395
Message ID      = RAC0702
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2023-01-25 11:14:16
Message         = Requested system powercycle.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

Few questions to narrow down the options here, either if you know or for next time. While the cookbook was polling:

  • was the host IP pingable?
  • if pingable, did it had an SSH server listening on port 22?
  • there was any pending firmware upgrade or iDRAC settings that required a reboot to be applied?

Thanks for the response @Volans!

This task was brought to my attention by @ssingh today because cp4037 did the same.
It was reimaged first around 12:15 and it failed, and then again around 14:07 and it succeeded.

Looking at the DHCP logs I don't see any request coming for the first attempt. The first DHCPDISCOVER on install4001 is at Jan 25 14:11:03.
One of the possible explanation is that the server didn't' actually reboot at all or didn't reboot into PXE. As historically we've seen some cases like these in the past. But in both cases at the console you should have seen a normal login prompt or reboot.

Correct and we didn't see that; nothing except a blank output in the console. (I use "we" in the context that both @BCornwall and I have faced this, independently.)

The cookbook sets the PXE bit via IPMI (we want to move to Redfish once the fleet has a recent enough version everywhere that we trust it working reliably) and then checks it via the chassis bootparam get 5 IPMI command that retuned:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 8004000000
 Boot Flags :
   - Boot Flag Valid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

And those are exactly the same values reported by the second attempt that worked fine.

From the HW logs there is a reboot but seems ~1h before the actual reimage:

--------------------------------------------------------------------------------
SeqNumber       = 396
Message ID      = SYS1005
Category        = Audit
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2023-01-25 11:14:22
Message         = The server power action is initiated because the management controller initiated a power-down operation.
Message Arg   1 = the management controller initiated a power-down operation
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 395
Message ID      = RAC0702
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2023-01-25 11:14:16
Message         = Requested system powercycle.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

Few questions to narrow down the options here, either if you know or for next time. While the cookbook was polling:

  • was the host IP pingable?
  • if pingable, did it had an SSH server listening on port 22?
  • there was any pending firmware upgrade or iDRAC settings that required a reboot to be applied?

I am surprised, so the above output is for cp4037? Because we certainly didn't reboot it and in any case, had that happened (without depooling), we would got some other alerts. I did reboot cp2031 for the iDRAC/NIC updates, but I am sure we didn't touch cp4037 prior to the reimage.

We will try the above diagnostic steps for the next time. Thanks for looking into this!

I am surprised, so the above output is for cp4037? Because we certainly didn't reboot it and in any case, had that happened (without depooling), we would got some other alerts. I did reboot cp2031 for the iDRAC/NIC updates, but I am sure we didn't touch cp4037 prior to the reimage.

Yes that's from cp4037.mgmt.ulsfo.wmnet racadm... unless the iDRAC clock is off by 1h, but the time of my last login seems actually correctly on UTC...

I just went through the logs now:

Timestamp       = 2023-01-25 14:07:50
Message         = The server power action is initiated because the management controller initiated a power-down operation.

This matches the time when the cookbook was started again, as per: T321309#8557391.

The werid part though is that in the logs prior:

Timestamp       = 2023-01-25 11:14:22
Message         = The server power action is initiated because the management controller initiated a power-down operation.

This doesn't match the first attempt of the reimage, which was at 12:15 UTC: T321309#8556947 and I also don't see it in the logs.

racadm>>lclog view -r "2023-01-25 12:01:00" -e "2023-01-25 12:20:00"
SeqNumber       = 402
Message ID      = SYS336
Category        = Config
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2023-01-25 12:18:19
Message         = An existing hash value is updated because some system configuration items are changed.
--------------------------------------------------------------------------------

However, an hour earlier and as per your comment above too,

racadm>>lclog view -r "2023-01-25 11:01:00" -e "2023-01-25 11:20:00"
--------------------------------------------------------------------------------
SeqNumber       = 396
Message ID      = SYS1005
Category        = Audit
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2023-01-25 11:14:22
Message         = The server power action is initiated because the management controller initiated a power-down operation.
Message Arg   1 = the management controller initiated a power-down operation
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------

So if the iDRAC clock is off by an hour, I wonder why the second time matches but not the first? Did I get some time calculation wrong here?

Update: This happened again when imaging cp4038. I was unable to ping the interfaces but was able to connect to the mgmt interface/iDRAC. The console screenshot suggests it was stuck on pxe boot:

Screenshot at 2023-01-26 12-06-04.png (182×592 px, 33 KB)

There was nothing in the Maintenance → Job Queue section

Re-running the cookbook and I watched it get past that screen with no delay

Screenshot at 2023-01-26 12-39-59.png (357×1 px, 81 KB)

This is happening the first time I run the cookbooks on any of the newer servers. I've now adapted to the workflow of running the cookbook, waiting for it to get stuck, control-c-ing the cookbook, then running it again with --new (maybe needing to remove the dhcp file on the install host). It always works the second time.

This is happening the first time I run the cookbooks on any of the newer servers. I've now adapted to the workflow of running the cookbook, waiting for it to get stuck, control-c-ing the cookbook, then running it again with --new (maybe needing to remove the dhcp file on the install host). It always works the second time.

Thanks @BCornwall! I can confirm Brett's observations, so we have two data points.

Another thing to keep in mind: these new hosts (all cp hosts in eqsin and ulsfo; added in Nov/Dec 2022) were reimaged last month and while we had "similar" issues with eqsin, we could never pinpoint the exact issues and we attributed them to one-offs. ulsfo had some issues but that was mostly due to the new hardware settings -- we didn't observe this "run twice" thing in ulsfo. So whatever we observed in eqsin is what is happening again in eqsin and ulsfo.

I am also wondering if this has got something to do with the firmwares of both the iDRAC and NIC in these hosts. We have seen similar issues with the older hardware where we have to upgrade both firmwares to get the d-i bullseye installer to continue. I will investigate this tomorrow in a bit more detail.

I guess my theory about the incorrect/outdated firmwares was incorrect. On cp4047:

Integrated Dell Remote Access Controller 	5.10.30.00 
Broadcom Adv. Dual 25Gb Ethernet - 84:16:0C:73:7B:30 	21.85.21.92
Broadcom Adv. Dual 25Gb Ethernet - 84:16:0C:73:7B:31 	21.85.21.92

cp4039:

Broadcom Adv. Dual 25Gb Ethernet - 84:16:0C:74:BA:20 	21.85.21.92
Broadcom Adv. Dual 25Gb Ethernet - 84:16:0C:74:BA:21 	21.85.21.92 
Integrated Dell Remote Access Controller 	5.10.30.00

This is the correct/expected version of the iDRAC and NIC firmwares. So yeah, we can rule out this theory.

on cp5029 reimage steps

  • start reimage cookbook on one terminal
  • start console on another terminal

on the console terminal the server pxe boot and get stuck at loading Debian image
on the cookbook terminal the cumin execution count time out because the the server is still stuck on loading the Debian image "Unable to verify that the host is inside the Debian installer, please verify manually"
If you wait more longer the server will pass the loading Debian image and start the install.
back to the cumin console, type go then cumin "Unable to verify that the host rebooted into the new OS, it might still be in the Debian install" because the host is still doing the OS install"
The process during the OS install took too long.

@Volans and I were discussing this on IRC today, some more observations with cp5019, that failed the first attempt but worked on the second.

  1. In the first attempt on install5001, we don't see anything in the DHCP logs on observation so there isn't much to go about from there.
  2. On the second attempt, we see:
Feb  1 15:57:44 install5001 dhcpd[26603]: DHCPDISCOVER from 84:16:0c:b7:bc:e0 via 10.132.0.3
Feb  1 15:57:44 install5001 dhcpd[26603]: DHCPOFFER on 10.132.0.19 to 84:16:0c:b7:bc:e0 via 10.132.0.3
Feb  1 15:57:44 install5001 dhcpd[26603]: DHCPDISCOVER from 84:16:0c:b7:bc:e0 via 10.132.0.2
Feb  1 15:57:44 install5001 dhcpd[26603]: DHCPOFFER on 10.132.0.19 to 84:16:0c:b7:bc:e0 via 10.132.0.2
Feb  1 15:57:48 install5001 dhcpd[26603]: DHCPREQUEST for 10.132.0.19 (103.102.166.13) from 84:16:0c:b7:bc:e0 via 10.132.0.2
Feb  1 15:57:48 install5001 dhcpd[26603]: DHCPACK on 10.132.0.19 to 84:16:0c:b7:bc:e0 via 10.132.0.2
Feb  1 15:57:48 install5001 dhcpd[26603]: DHCPREQUEST for 10.132.0.19 (103.102.166.13) from 84:16:0c:b7:bc:e0 via 10.132.0.3
Feb  1 15:57:48 install5001 dhcpd[26603]: DHCPACK on 10.132.0.19 to 84:16:0c:b7:bc:e0 via 10.132.0.3
Feb  1 15:57:48 install5001 atftpd[460]: Serving lpxelinux.0 to 10.132.0.19:2070
Feb  1 15:57:50 install5001 atftpd[460]: Serving lpxelinux.0 to 10.132.0.19:2071
Feb  1 15:57:50 install5001 atftpd[460]: Serving lpxelinux.0 to 10.132.0.19:2072

And the host then reboots successfully and PXE boots continues, as expected.

So then in the case where that fails, is it possible that this is a network issue? Adding @ayounsi and @cmooney to see if we can debug this further from the switch side of things to see what is happening? Just for some more information, almost every single reimaging in eqsin has failed for us, so something is definitely happening here! Thanks, and happy to debug this further.

Install servers are being migrated to Bullseye in T327867: Migrate the install servers to Bullseye so even though the observed issue is probably not related, it would be better to wait for the migration to finish before investigating it.

I had a look at the config and can't see anything out of the ordinary (it's also all standardized now).

When provisioning a server you can run: cr2/3-eqsin> monitor traffic interface ae1.520 matching "ether host 84:16:0c:b7:bc:e0" extensive (with the MAC address of the server showing up in the PXE screen) to see if it's actually sending DHCP queries.

Feel free to ping us when you're experiencing the issue too so we can live troubleshot it.

@ssingh, we can agree that this was a NIC issue, yeah? If so, this can be marked as resolved since upgrading the NIC firmware allowed us to move past the stuck PXE boot.

Hi @BCornwall: Thanks for checking! I am pretty convinced that this is not related to the NIC firmware, for the following reasons; but I may be missing something and so just writing it here:

  • The PXE boot worked fine for us in cases with the old firmware as well; the DHCP in d-i failed and that's a different step. The latter was the one fixed by the firmware upgrade. What was failing for us in eqsin that the host after rebooting failed to even reach the PXE boot stage.
  • The hosts in ulsfo and eqsin had the new hardware as part of the refreshes in the last quarter and also had the newer (or at least the base version) of the firmware that worked. For some of the hosts in ulsfo that I reimaged, I didn't update any firmware and the process still completed without any issues.
  • We had faced similar issues with the eqsin refresh in the last quarter as well, which leads me to believe that the issue is something else and perhaps something that is specific to eqsin.
    • In ulsfo, we did have this issue in one or two hosts in this quarter, but I didn't perform any firmware upgrade and simply restarted the cookbook and that alleviated the issue. This is also true for all the eqsin hosts where the cookbook finished by simply restarting it.
  • We can follow @ayounsi's recommendation above and run the command from the core router in eqsin to see if get something useful.

I guess the main question is for the hosts in eqsin that failed where you restarted the cookbook, was there a firmware upgrade in between that process? Or was there a simple restart? If you did update the firmware, that certainly changes things!

I am not sure if I missed something but in any case, I think we should keep this task open and try to understand this issue in a bit more detail.

I guess the main question is for the hosts in eqsin that failed where you restarted the cookbook, was there a firmware upgrade in between that process? Or was there a simple restart? If you did update the firmware, that certainly changes things!

The eqsin hosts that failed needed only a second run of the cookbook. I was under the impression that the NIC upgrade had resolved that issue (i.e. no second cookbook run was necessary after the upgrade). But I might be mis-remembering...

RobH subscribed.

This seems to have stalled with an inconclusive assumption that the NIC firmware update solved it, but we have no confirmation of that (see @ssingh's last comment above).

I'm going to pull out the ops-eqsin tag, as this is no longer a DC ops specific issue (so it'll come off my daily tracking of caching site tasks.) If this needs to have the project added back and DC ops involvement, please feel free to do so.

BCornwall changed the task status from Open to Stalled.Mar 17 2023, 6:37 PM
BCornwall moved this task from Backlog to Ready for work on the Traffic board.

@ssingh do you think this is still an issue that's worth keeping open, and should it then be tagged to IF?

Thanks @BCornwall, I think we can close this one as we have done some other reimages in eqsin and not observed this issue.

BCornwall claimed this task.