Page MenuHomePhabricator

Q1:rack/setup/install ms-be208[1-8]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ms-be208[1-8]

Hostname / Racking / Installation Details

Hostnames: ms-be208[1-8]
Racking Proposal: Spread across rows as evenly as possible
Networking Setup: # of Connections:1 - Speed:10G. - VLAN:Private AAAA records:Y, Additional IP records (Cassandra)? No
Partitioning/Raid: JBOD
OS Distro: bullseye
Sub-team Technical Contact: @MatthewVernon

Per host setup checklist

ms-be2081
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2082
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2083
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2084
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2085
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2086
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2087
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ms-be2088
  • Receive in system on procurement task T368928 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think, given @jhathaway's update on T378584 we should try booting these nodes in UEFI mode, since it sounds like our infrastructure is about ready for that.

It's probably worth seeing if that resolves the /dev/disk/by-path duplication too (one can but hope!)?

We do have support for UEFI in the provision cookbook and in reimage (after https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1077497 is merged), but there are a couple of things that we are still working on, see the subtasks of T373519. There is nothing incredibly blocking but we are going outside the perimeter of what is battle tested in production, please be advised that there may be further issues to debug and the ms-be hosts would be the first production hosts to use UEFI. If everybody is onboard with this, we can go ahead :)

I think from that the two big issues are the partman cookbooks (which we'd obviously need the one we're using for these nodes to work!) and some loss of boot resilience in the face of loss of one of the two SSDs that the OS is on? The latter is I think not a show-stopper as we could reimage a node if needs be.

I'm now expecting the two new thanos-be nodes to be arriving this week (cf T368445 and T368446), and they need to be brought into service ASAP; I think at this point if UEFI booting works (especially if it resolves the duplicate by-path issue) and means we have systems where we can swap disks without having to reboot and change boot-mode round, then that seems like the best path-to-deployment for these new nodes.

Seem sensible?

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2001.codfw.wmnet with OS bookworm completed:

  • sretest2001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411041623_pt1979_2825314_sretest2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1087505 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] ms-be: partman EFI recipe

https://gerrit.wikimedia.org/r/1087505

Change #1087505 abandoned by JHathaway:

[operations/puppet@production] ms-be: partman EFI recipe

Reason:

Chatted with MVernon on IRC, we actually want to modify ms-be_simple.cfg

https://gerrit.wikimedia.org/r/1087505

Change #1087538 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] ms-be-simple: partman EFI recipe

https://gerrit.wikimedia.org/r/1087538

Change #1087538 merged by Elukey:

[operations/puppet@production] ms-be-simple: partman EFI recipe

https://gerrit.wikimedia.org/r/1087538

Change #1087858 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::installserver::preseed: use the EFI recipe for ms-be2083

https://gerrit.wikimedia.org/r/1087858

Change #1087858 merged by Elukey:

[operations/puppet@production] profile::installserver::preseed: use the EFI recipe for ms-be2083

https://gerrit.wikimedia.org/r/1087858

@MatthewVernon I tried to provision/reimage ms-be2083 with UEFI but we have the same /dev/disk/by-path duplication issue, I think it is something intrinsic in how the SAS controller is supported by udev/linux. We can either wait for the new controller to be deployed or adjust the puppet fact code to take into account the new format in /dev/disk/by-path.

Alas :(

I think adjusting the fact is the way to go? Presumably it now needs to keep track of the targets of the symlinks in /dev/disk/by-path and only emit one symlink per target...

Alas :(

I think adjusting the fact is the way to go? Presumably it now needs to keep track of the targets of the symlinks in /dev/disk/by-path and only emit one symlink per target...

It would be my preference as well yes, it seems the quickest way forward. I can help in reviewing the changes for the new facts if you want! After that we should be able to reprovision/reimage all ms-be nodes with SAS controllers.

Change #1087891 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] facts: adjust swift_disks fact to handle new SM kit

https://gerrit.wikimedia.org/r/1087891

Change #1087891 merged by MVernon:

[operations/puppet@production] facts: adjust swift_disks fact to handle new SM kit

https://gerrit.wikimedia.org/r/1087891

Change #1087935 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] swift: use regex to handle Dell & SM accounts|containers disks

https://gerrit.wikimedia.org/r/1087935

Change #1087935 merged by MVernon:

[operations/puppet@production] swift: use regex to handle Dell & SM accounts|containers disks

https://gerrit.wikimedia.org/r/1087935

Change #1087949 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes

https://gerrit.wikimedia.org/r/1087949

Change #1087949 merged by MVernon:

[operations/puppet@production] preseed - use ms-be_simple-efi.cfg for new SM Config-J nodes

https://gerrit.wikimedia.org/r/1087949

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2082 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411072103_jhathaway_3627150_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2082.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm completed:

  • ms-be2082 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411072237_jhathaway_3649509_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@elukey I tried reproducing the double Debian installer bug, but I failed, the steps I tried.

  1. UEFI reimage, just to confirm existing setup, successful
  2. Re-provisioned in BIOS mode
  3. Re-provisioned in UEFI mode
  4. UEFI reimage, successful, single debian installer

The only notable piece was that when switching to UEFI mode during provisioning, the Supermicro host rebooted itself once.

Perhaps there is a factory setting that is causing the bug, which somehow is removed after the first provisioning?

@jhathaway thanks a ton for the tests, it was exactly what I had in mind to do today :)

The only notable piece was that when switching to UEFI mode during provisioning, the Supermicro host rebooted itself once.

The above is really interesting, let's reason out loud. The way that I coded the BIOS updates for Supermicro happens in two steps:

  1. Only a subset of changes are applied, most notably BIOS or UEFI. I noticed in the past that when flipping one or the other it happens that some BIOS options are reconfigured or changed (name, parameters accepted, etc..) so as a precautionary measure, I chose to flip Legacy/UEFI first.
  2. The second pass applies all the rest, including PXE settings etc..

After both steps there is a chassis reset, so in theory we should get to the end of the cookbook with all the settings applied.

Here an example for ms-be2083:

2024-11-06 08:46:44,406 elukey 671471 [INFO] Setting up BootMode and basic BIOS settings.
2024-11-06 08:46:44,406 elukey 671471 [INFO] BIOS: BootModeSelect is set to Legacy, while we want UEFI
2024-11-06 08:46:44,406 elukey 671471 [INFO] BIOS: QuietBoot is set to True, while we want False
2024-11-06 08:46:44,406 elukey 671471 [INFO] BIOS: IntelVirtualizationTechnology is set to Enable, while we want Disable
2024-11-06 08:46:44,407 elukey 671471 [INFO] Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).
2024-11-06 08:46:44,407 elukey 671471 [INFO] Applying BIOS settings...
2024-11-06 08:46:44,591 elukey 671471 [INFO] Rebooting the host with policy ChassisResetPolicy.FORCE_RESTART and waiting for 5 minutes
2024-11-06 08:46:44,592 elukey 671471 [INFO] Resetting chassis power status for ms-be2083 to ForceRestart   <===============
2024-11-06 08:51:44,824 elukey 671471 [INFO] Retrieving BIOS settings (second round).
2024-11-06 08:51:44,824 elukey 671471 [INFO] Retrieving updated BIOS settings...
2024-11-06 08:51:45,115 elukey 671471 [INFO] BIOS: IPv4HTTPSupport is set to Disabled, while we want Enabled
2024-11-06 08:51:45,115 elukey 671471 [INFO] BIOS: IPv4PXESupport is set to Enabled, while we want Disabled
2024-11-06 08:51:45,115 elukey 671471 [INFO] BIOS: IPv6PXESupport is set to Enabled, while we want Disabled
2024-11-06 08:51:45,116 elukey 671471 [INFO] Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).
2024-11-06 08:51:45,116 elukey 671471 [INFO] Applying BIOS settings...
2024-11-06 08:51:45,245 elukey 671471 [INFO] Applying Network changes to the BMC.
2024-11-06 08:51:46,048 elukey 671471 [INFO] Rebooting the host with policy ChassisResetPolicy.FORCE_RESTART and waiting for 5 minutes
2024-11-06 08:51:46,049 elukey 671471 [INFO] Resetting chassis power status for ms-be2083 to ForceRestart   <===============

Now let's analyze your provision runs for ms-be2082:

UEFI -> BIOS

2024-11-07 21:27:00,539 jhathaway 3635956 [INFO] Setting up BootMode and basic BIOS settings.
2024-11-07 21:27:00,540 jhathaway 3635956 [INFO] BIOS: BootModeSelect is set to UEFI, while we want Legacy
2024-11-07 21:27:00,540 jhathaway 3635956 [INFO] Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).
2024-11-07 21:27:00,540 jhathaway 3635956 [INFO] Applying BIOS settings...
2024-11-07 21:27:00,701 jhathaway 3635956 [INFO] Rebooting the host with policy ChassisResetPolicy.GRACEFUL_RESTART and waiting for 5 minutes
2024-11-07 21:27:00,701 jhathaway 3635956 [INFO] Resetting chassis power status for ms-be2082 to GracefulRestart
2024-11-07 21:32:00,920 jhathaway 3635956 [INFO] Retrieving BIOS settings (second round).
2024-11-07 21:32:00,921 jhathaway 3635956 [INFO] Retrieving updated BIOS settings...
2024-11-07 21:32:01,111 jhathaway 3635956 [INFO] BIOS: IPv4HTTPSupport is set to Enabled, while we want Disabled
2024-11-07 21:32:01,111 jhathaway 3635956 [INFO] BIOS: IPv4PXESupport is set to Disabled, while we want Enabled
2024-11-07 21:32:01,111 jhathaway 3635956 [INFO] BIOS: AOC_A25G_b2SLAN1OPROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: M_2_HC_1OPROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: M_2_HC_2OPROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: OnboardNVMe0OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: OnboardNVMe1OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: OnboardNVMe2OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: OnboardNVMe3OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,112 jhathaway 3635956 [INFO] BIOS: OnboardNVMe4OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: OnboardNVMe5OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: OnboardNVMe6OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: OnboardNVMe7OptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: OnboardVideoOptionROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: P2SLOT1PCI_E4_0X16OPROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: P2SLOT2PCI_E4_0X16OPROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] BIOS: P2SLOT3PCI_E4_0X16OPROM is set to EFI, while we want Legacy
2024-11-07 21:32:01,113 jhathaway 3635956 [INFO] Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).
2024-11-07 21:32:01,114 jhathaway 3635956 [INFO] Applying BIOS settings...
2024-11-07 21:32:01,212 jhathaway 3635956 [INFO] Applying Network changes to the BMC.
2024-11-07 21:32:01,611 jhathaway 3635956 [INFO] Rebooting the host with policy ChassisResetPolicy.GRACEFUL_RESTART and waiting for 5 minutes
2024-11-07 21:32:01,611 jhathaway 3635956 [INFO] Resetting chassis power status for ms-be2082 to GracefulRestart

I don't find logs on cumin2002 related to BIOS -> UEFI, not sure where/when it was flipped back, but I re-ran the cookbook and I see all the options set up correctly.

I tried with ms-be2085, doing the following:

  • Provision to UEFI, manual/extra chassis reset triggered via spicerack-shell.
  • Verify via Redfish that no pending BIOS-config changes were listed (namely, in need of a chassis reset to get applied).
  • Configured the JBOD drives via BIOS-config utility, then save/exit
  • Kicked off a reimage, failed with double d-i
  • Stopped the reimage, kicked off another one, all good.

So this seems to be something happening only the first time we do provisioning, wondering if the state left after changing the disks to JBOD have any influence in this. With the next host I'll try a more clean chassis reset after configuring the disks, maybe this is the culprit (although I don't see how things could relate but who knows).

Very interesting - I watched the sol1 console of ms-be2086 when doing provisioning, and right after the second round of reboot (for BIOS updates) I noticed an attempt to PXE boot over HTTP, failed and ended up in:

>>Checking Media Presence......
>>Media Present......
>>Start HTTP Boot over IPv4.

[..]
UEFI Interactive Shell v2.2
EDK II
UEFI v2.80 (American Megatrends, 0x00050016)
Mapping table
     BLK0: Alias(s):
          PciRoot(0x0)/Pci(0x11,0x5)/Sata(0x0,0xFFFF,0x0)
     BLK1: Alias(s):
          PciRoot(0x0)/Pci(0x11,0x5)/Sata(0x1,0xFFFF,0x0)
Press ESC in 1 seconds to skip startup.nsh or any other key to continue.
Shell>

And this happens every time I tried to reboot. So maybe the first reimage configures the EFI partition on disks, skipping the HTTP boot attempt?

This is the boot order right after provisioning:

'BootModeSelect': 'UEFI',                                                                 
'BootOption_1': 'UEFI Hard Disk', 
'BootOption_1_4': '(B23/D0/F1) UEFI HTTP IPv4: Broadcom '                                 
                  'Network Device - '                                                     
                  '90:5A:08:00:B7:BB(MAC:905a0800b7bb)',                                  
'BootOption_1_5': 'UEFI: Built-in EFI Shell',

Maybe we need to remove BootOption_1_4, or just move BootOption_1_5 (EFI shell) up after Hard Disk?

Edit: changing the boot order is possible from X13 onward, and we have X12..

Another test, leading to weird results. I tried to do the following:

  • Manually disable IPV4HTTPSupport via spicerack shell, to basically stop ms-be2086 to automatically boot via PXE/HTTP if for some reason the HDD step failed.
  • Verified that indeed before any d-i I was ending up in EFI Shell.
  • Kicked off a reimage, with these changes https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1088524 (via test-cookbook). Basically the idea was to enable/disable HTTP support only before/after d-i, as test.
  • The reimaged failed, since I ended up in the EFI shell anyway. I rechecked the BIOS Boot order via spicerack-shell, and now I see that the second boot item after HDD is EFI shell, not HTTP boot anymore.
  • tried to Disable HTTP, reboot, enable HTTP, reboot and re-checked the BIOS values. Same issue, EFI shell showing errors.
  • tried then to restore the boot order manually via BIOS-config (mgmt console), first hdd then HTTP PXE boot, and that cleared the issue (namely no more EFI shell errors etc..). No idea why it happened.
  • Kicked off the reimage of ms-be2086, no issue while reimaging.

So far I provisioned up to ms-be2087, and ms-be2088 was left untouched. The ADMIN/root password should already be set to the one on pwstore, so if you want to go ahead and test with 2088 please do it :)

I also tried to not configure any special JBOD config for ms-be2087 after provision, and kick off reimage to see if the double d-i issue appeared (to rule out special SAS controller features/settings) but no luck, still double d-i at first try.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm executed with errors:

  • ms-be2082 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2082.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2082 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2082.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411081734_jhathaway_3855743_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-11-08T19:59:27Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-08T19:59:41Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2082 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2082.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2082 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • XXX Forced UEFI regular Boot for next reboot
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2082.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • XXX Forced UEFI regular Boot for next reboot
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411082244_jhathaway_3941074_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

@elukey I was able to reproduce the issue, by wiping the files from the efi partition, before kicking off another re-image. I think the problem is actually in the debian-installer, rather than on the supermicro side, which is why we don't see this issue on sretest2001.codfw.wmnet. I think the debian-installer is failing to install grub properly and create the efi boot entry, which is part of the grub install process. I think the issue is related to setting grub-installer/bootdev which is done by autoinstall/scripts/partman_early_command.sh on the ms-be boxes. On ms-be2082 this evaluated to grub-installer/bootdev /dev/sdj /dev/sdk which seems correct, but perhaps /dev/sdk needs to be first? I also tried setting grub-installer/only_debian boolean false, which we set in the raid1-2dev-efi.cfg, but that didn't seem to have any effect, so I don't think we are still hitting, "#this workarounds LP #1012629 / Debian #666974", but I'm also not sure. I am off Monday, but happy to investigate more on Tuesday.

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411102317_jhathaway_194183_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@elukey I was able to reproduce the issue, by wiping the files from the efi partition, before kicking off another re-image. I think the problem is actually in the debian-installer, rather than on the supermicro side, which is why we don't see this issue on sretest2001.codfw.wmnet. I think the debian-installer is failing to install grub properly and create the efi boot entry, which is part of the grub install process. I think the issue is related to setting grub-installer/bootdev which is done by autoinstall/scripts/partman_early_command.sh on the ms-be boxes. On ms-be2082 this evaluated to grub-installer/bootdev /dev/sdj /dev/sdk which seems correct, but perhaps /dev/sdk needs to be first? I also tried setting grub-installer/only_debian boolean false, which we set in the raid1-2dev-efi.cfg, but that didn't seem to have any effect, so I don't think we are still hitting, "#this workarounds LP #1012629 / Debian #666974", but I'm also not sure. I am off Monday, but happy to investigate more on Tuesday.

Very interesting! Let's chat tomorrow :)
I am wondering why the status is cleared after the first reimage, but let's discuss further about next steps when you are back online.

As FYI, I wanted to finish ms-be2088 but I wasn't able to reimage, since afaics there is no NIC connected to a switch. I pinged dcops :)

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411122155_jhathaway_639598_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411122311_jhathaway_654291_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@jhathaway another episode of the saga, ms-be2088 :D

I tried to reimage it to see if the last patch of reimage to force Hdd after debian install changed anything, and afaics the double d-i issue didn't appear, but while checking the mgmt console I saw d-i stuck at:

┌───────────────────┤ [!!] Configuring puppet-agent ├───────────────────┐    
   │                                                                       │    
  ┌│                         Installation complete                         │    
  ││ Installation is complete, so it is time to boot into your new system. │    
  ││ Make sure to remove the installation media, so that you boot into the │    
  ││ new system rather than restarting the installation.                   │    
  ││                                                                       │    
  ││     <Go Back>                                          <Continue>     │    
  └│

Once I gave the Continue command, everything completed fine. This is new, I never seen it before, and afaics from gerrit we didn't change much..

Edit: my bad, this happened because all the disks were not in JBOD mode, I thought I had configured them but I misremembered.

I didn't see the double d-i issue but the test is not good since multiple d-i happened before the good one, so I think we should keep going with Jesse's approach (that I believe is to boot to the os, wipe the EFI partition with dd or similar and then reimage).

Sorry, egg on my face, that was my fault. I commented out the auto
reboot so I could do some debugging yesterday, before the reboot, but
forgot to remove the puppet override.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye executed with errors:

  • ms-be2082 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ms-be2082.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411131749_jhathaway_851469_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-11-13T18:13:24Z] <jhathaway@cumin2002> START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-13T18:13:30Z] <jhathaway@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm completed:

  • ms-be2082 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411132016_jhathaway_877652_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm completed:

  • ms-be2082 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411132119_jhathaway_887862_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm completed:

  • ms-be2082 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Forced UEFI regular Boot for next reboot
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411132257_jhathaway_905076_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bullseye completed:

  • ms-be2082 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411140411_jhathaway_959588_ms-be2082.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@jhathaway something interesting that I found on Redfish related to BIOS boot options:

ms-be2088

BootModeSelect UEFI
BootOption_1 UEFI Hard Disk:debian
BootOption_1_4 debian(SATA,Port:0)
BootOption_1_5 (B202/D0/F0) UEFI HTTP IPv4: Intel(R) I350 Gigabit Network Connection(MAC:7cc255696588)
BootOption_1_6 UEFI: Built-in EFI Shell
BootOption_2 UEFI CD/DVD
BootOption_2_4 debian(SATA,Port:1)
BootOption_2_5 (B202/D0/F1) UEFI HTTP IPv4: Intel(R) I350 Gigabit Network Connection(MAC:7cc255696589)
BootOption_3 UEFI USB Hard Disk
BootOption_3_4 (B23/D0/F0) UEFI HTTP IPv4: Broadcom Network Device - 90:5A:08:00:B3:3A(MAC:905a0800b33a)
BootOption_4 UEFI USB CD/DVD
BootOption_4_4 (B23/D0/F1) UEFI HTTP IPv4: Broadcom Network Device - 90:5A:08:00:B3:3B(MAC:905a0800b33b)
BootOption_5 UEFI USB Key
BootOption_6 UEFI USB Floppy
BootOption_7 UEFI USB Lan
BootOption_8 UEFI Network:(B202/D0/F0) UEFI HTTP IPv4: Intel(R) I350 Gigabit Network Connection(MAC:7cc255696588)
BootOption_9 UEFI AP:UEFI: Built-in EFI Shell

ms-be2082:

BootModeSelect UEFI
BootOption_1 UEFI Hard Disk:debian
BootOption_1_4 debian(SATA,Port:1)
BootOption_1_5 (B202/D0/F0) UEFI HTTP IPv4: Intel(R) I350 Gigabit Network Connection(MAC:7cc255696498)
BootOption_1_6 UEFI: Built-in EFI Shell
BootOption_2 UEFI CD/DVD
BootOption_2_4 (B202/D0/F1) UEFI HTTP IPv4: Intel(R) I350 Gigabit Network Connection(MAC:7cc255696499)
BootOption_2_5 Disabled
BootOption_3 UEFI USB Hard Disk
BootOption_3_4 (B23/D0/F0) UEFI HTTP IPv4: Broadcom Network Device - 90:5A:08:00:B3:8A(MAC:905a0800b38a)
BootOption_4 UEFI USB CD/DVD
BootOption_4_4 (B23/D0/F1) UEFI HTTP IPv4: Broadcom Network Device - 90:5A:08:00:B3:8B(MAC:905a0800b38b)
BootOption_5 UEFI USB Key
BootOption_6 UEFI USB Floppy
BootOption_7 UEFI USB Lan
BootOption_8 UEFI AP:UEFI: Built-in EFI Shell
BootOption_9 UEFI Network:(B202/D0/F0) UEFI HTTP IPv4: Intel(R) I350 Gigabit Network Connection(MAC:7cc255696498)

I am not 100% what the BootOption format is, but ms-be2088 (reimaged with the continuous Hdd flag) shows two debian(SATA,Port:X values, meanwhile ms-be2082 doesn't. The EFI partition is installed on two disk partitions on both nodes (and this is good), so we should be able to boot even if one of the two rotational disks is failed, but I am wondering if the BIOS/UEFI boot settings agree with this thought. Does it make sense?

Also, It may be not relevant, but I am wondering how dangerous it is the presence of UEFI HTTP multiple times after all disks.

Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at this point it may not be relevant, but I don't explain the above differences. Maybe we just need to reimage all of them another time and they will get the same conf?

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm completed:

  • maps-test2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411182121_pt1979_2058156_maps-test2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Mentioned in SAL (#wikimedia-operations) [2024-11-19T17:41:49Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T17:42:02Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T18:32:13Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T18:32:17Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T19:05:07Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T19:05:17Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T20:10:23Z] <jhathaway@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Mentioned in SAL (#wikimedia-operations) [2024-11-19T20:10:27Z] <jhathaway@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2082.codfw.wmnet with reason: T371400

Jhancock.wm updated the task description. (Show Details)