Page MenuHomePhabricator

Q4:rack/setup/install sretest2010 Config J 1P test host
Open, MediumPublic

Description

This task will track the racking, setup, and OS installation of sretest2010 Config J 1P test host

Hostname / Racking / Installation Details

Hostnames: sretest2010 (just assign next sequence number available, 2003 at time of this task filing.)
Racking Proposal: Anywhere, zero restrictions.
Networking Setup: # of Connections:1 Speed:10G. - VLAN:Private
OS Distro: Bookworm (default unless otherwise specified)
Boot Method: Test UEFI if possible, otherwise BIOS fine.
Sub-team Technical Contact: RobH

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

sretest2010:
  • Receive in system on procurement task <enter task # here> & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1170085 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.provision: add custom settings for Supermicro

https://gerrit.wikimedia.org/r/1170085

I retried the provision cookbook but for some reason that I cannot explain, I am not able to trigger a host reboot/powercycle:

  • Tried via reset /system1/pwrmgtsvc1 in the mgmt console.
  • Tried via Redfish, chassis force reset and graceful restart ( r.request('get', '/redfish/v1/Systems/1').json()['BootProgress'] shows also LastState: None).

@Jhancock.wm could you please try to powercycle the host in the DC? I am not sure if it is stuck in a weird way, cannot do much on my side :(

@elukey host was found powered off. pulled the power and then restarted.

i also checked the settings and it _looks_ okay. I'll try running the imaging script this afternoon and see what happens.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host sretest2010.codfw.wmnet with OS bookworm

Change #1174405 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add sretest2010 to the catalog

https://gerrit.wikimedia.org/r/1174405

Change #1174405 merged by Elukey:

[operations/puppet@production] Add sretest2010 to the catalog

https://gerrit.wikimedia.org/r/1174405

@Jhancock.wm sretest2010 successfully provisioned and reimaged. Given the amount of extra disks of ~7TB I suspect this is another hadoop-like node to test, so the current Partman recipe may not be the correct one (but it can be easily changed before the test).

elukey@sretest2010:~$ sudo fdisk -l
Disk /dev/sdb: 447.13 GiB, 480103981056 bytes, 937703088 sectors
Disk model: Micron_5400_MTFD
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: F7FA6268-A6C6-41BF-B864-9BF458E1A27A

Device      Start       End   Sectors   Size Type
/dev/sdb1    2048      4095      2048     1M BIOS boot
/dev/sdb2    4096    503807    499712   244M EFI System
/dev/sdb3  503808 937701375 937197568 446.9G Linux RAID


Disk /dev/sda: 447.13 GiB, 480103981056 bytes, 937703088 sectors
Disk model: Micron_5400_MTFD
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: A294E904-A397-4D80-BF7B-E2D737990491

Device      Start       End   Sectors   Size Type
/dev/sda1    2048      4095      2048     1M BIOS boot
/dev/sda2    4096    503807    499712   244M EFI System
/dev/sda3  503808 937701375 937197568 446.9G Linux RAID


Disk /dev/md0: 446.76 GiB, 479709888512 bytes, 936933376 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdc: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1ACC6C12-A01C-4DAE-969C-99EDCC71F507

Device      Start         End     Sectors  Size Type
/dev/sdc1    2048        4095        2048    1M Linux filesystem
/dev/sdc2    4096      503807      499712  244M EFI System
/dev/sdc3  503808 15628052479 15627548672  7.3T Linux RAID


Disk /dev/sdd: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/mapper/vg0-swap: 976 MiB, 1023410176 bytes, 1998848 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/mapper/vg0-root: 74.5 GiB, 79997960192 bytes, 156246016 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/mapper/vg0-srv: 281.95 GiB, 302736474112 bytes, 591282176 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdf: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sde: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdh: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdg: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdj: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdi: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdk: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdl: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdm: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdo: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdn: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdq: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdr: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdp: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sds: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdt: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdu: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdv: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdw: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdy: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdx: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/sdz: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: ST8000NM018B    
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


elukey@sretest2010:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
udev                  252G     0  252G   0% /dev
tmpfs                  51G  1.7M   51G   1% /run
/dev/mapper/vg0-root   73G  2.8G   67G   5% /
tmpfs                 252G     0  252G   0% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
/dev/mapper/vg0-srv   277G   28K  263G   1% /srv
/dev/sdb2             241M  166K  241M   1% /boot/efi
tmpfs                  51G     0   51G   0% /run/user/13926
Jhancock.wm added a subscriber: MatthewVernon.

@MatthewVernon we're finished with this test server if you want to run some test on it. It's a 1 CPU version of the config-J servers you use.

Change #1185973 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] sretest2010: set to be installed like a new ms-be* node

https://gerrit.wikimedia.org/r/1185973

Change #1185973 merged by MVernon:

[operations/puppet@production] sretest2010: set to be installed like a new ms-be* node

https://gerrit.wikimedia.org/r/1185973

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bullseye completed:

  • sretest2010 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509090957_mvernon_1345486_sretest2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm executed with errors:

  • sretest2010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS bookworm completed:

  • sretest2010 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202509091204_mvernon_1415771_sretest2010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Hi @Jhancock.wm / @elukey . I've found 2 show-stoppers thus far (the second of which has left me blocked):

  1. this node cannot PXE boot without manual intervention from the HTML5 console (unlike the Dell systems, I've not found any way of sending F12 over the ssh-to-mgmt console), so the reimage cookbook doesn't work without manual intervention over the HTML5 console
  2. the trixie installer doesn't work with this host - once the installer has been downloaded, you just get a black screen. Looking at the log console on tty4, I see the final logs:
WARNING **: no packages matching running kernel 6.12.43+deb13-amd64 in archive
debconf: --> INPUT critical anna/no_kernel_modules
debconf: <-- 0 question will be asked
debconf: --> GO

I'm guessing this latter issue might be resolved by updating the installer image, but that may not fix the broken video for the installer prompt, which will be a problem.

I'll leave this host alone for now, so feel free to poke it as you see fit.

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie

Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host sretest2010.codfw.wmnet with OS trixie executed with errors:

  • sretest2010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console sretest2010.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Change #1187669 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] preseed: fix dse-k8s-worker1014's partman config

https://gerrit.wikimedia.org/r/1187669

Change #1187669 merged by Elukey:

[operations/puppet@production] preseed: fix dse-k8s-worker1014's partman config

https://gerrit.wikimedia.org/r/1187669

elukey reopened this task as Open.EditedSep 16 2025, 10:29 AM

Hi @Jhancock.wm / @elukey . I've found 2 show-stoppers thus far (the second of which has left me blocked):

  1. this node cannot PXE boot without manual intervention from the HTML5 console (unlike the Dell systems, I've not found any way of sending F12 over the ssh-to-mgmt console), so the reimage cookbook doesn't work without manual intervention over the HTML5 console

This is very weird since I reimaged the node successfully, so I am not 100% sure what's changed. I re-ran the provision cookbook but the issue persists.

Provision finds only one NIC in the BIOS, AOC_A25G_b2SLAN1OPROM, and it sets to "EFI". I checked the LinkUp status and it seems to be on the NIC port 1 of the aforementioned controller, so nothing strange on it. There is also a specific network boot option in the BIOS for it:

'UEFINETWORKBootOption_4': '(B111/D0/F0) UEFI HTTP IPv4 '
                           'Broadcom Network Device - '
                           '90:5A:08:9F:08:80(MAC:905A089F0880)',

While watching the mgmt console I noticed that during reimage PXE / HTTP boot is attempted, but it finishes very quickly and then the regular OS Boot kicks in. I suspect that something makes the HTTP boot to fail for some reason..

Edit: I see this:

>>Checking Media Presence......
>>No Media Present......

There is something definitely off, I just tested the following and everything hangs:

-> reset /system1/pwrmgtsvc1
/system1/pwrmgtsvc1

I am trying to set Legacy and the UEFI via cookbook provisioning to see if anything changes.

@Jhancock.wm Hi! When you have a moment could you please check if sretest2010 is in a weird state? I am not able to powercycle it..

@elukey i found it booted to the sretest and was responsive. mgmt ip pinged. rebooted it via keyboard. mgmt pings and i can login to the BMC. Not sure what's causing it to freeze up.

Change #1189118 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] WIP: sre.hosts.provision: check attributes after rebooting

https://gerrit.wikimedia.org/r/1189118

Updated the BMC and the firmware, that seems still in progress. Will check later on :)

Edit: all good, plus I reset the BMC to factory defaults.

Tried provision and then reimage, this time I clearly noticed a PXE/HTTP boot request but it ended up in the OS booting (it was quick and I didn't see the error).

I checked on install2005 and indeed there was a DHCP request:

elukey@install2005:~$ sudo journalctl -u isc-dhcp-server.service --since '1 hour ago' | grep -i 90:5A:08:9F:08:80
Sep 17 11:04:12 install2005 dhcpd[2529133]: DHCPDISCOVER from 90:5a:08:9f:08:80 via 10.192.58.1
Sep 17 11:04:12 install2005 dhcpd[2529133]: DHCPOFFER on 10.192.58.5 to 90:5a:08:9f:08:80 via 10.192.58.1
Sep 17 11:04:15 install2005 dhcpd[2529133]: DHCPREQUEST for 10.192.58.5 (208.80.153.70) from 90:5a:08:9f:08:80 via 10.192.58.1
Sep 17 11:04:15 install2005 dhcpd[2529133]: DHCPACK on 10.192.58.5 to 90:5a:08:9f:08:80 via 10.192.58.1

I retried again, "No media present" :(

I tried to powercycle it via mgmt/serial console but I don't see anything happening. This is really weird.

Change #1189118 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: check attributes after rebooting

https://gerrit.wikimedia.org/r/1189118

anything i can try onsite to help?

anything i can try onsite to help?

@Jhancock.wm not sure, I tried to upgrade the BIOS/BMC firmware + BMC reset but it didn't help. I have no idea why it worked perfectly when I tested it the first time..

Anyway, the only diff that I can see compared to other Supermicro models is that this is a X13 (so a higher end model), and it offers the following:

'HTTPBootPolicy': 'Apply to each LAN',
'HTTPSBootChecksHostname': 'Enabled',

In theory this should force the host to try PXE/HTTP boot on every NIC before giving up, without the need for us to set the right NIC with EFI/PXE. In other models where I saw this, there was no option to set PXE/EFI on a NIC via BIOS, meanwhile in this one I see it:

'AOC_A25G_b2SLAN1OPROM': 'EFI',

It shouldn't cause any problems but who knows.. Next step is probably to explore the BIOS manually and see if anything pops up :(

@Jhancock.wm I cannot reboot the host, tried via console and BMC/Redfish API, it seems stuck in some weird limbo. If you have a moment could you please check it?

@elukey found the server off. i could ping the BMC and login to it. I've powered it back up for you.

The host doesn't PXE/HTTP boot for some reason, I reopened the provision task in T394357#11184292.

I spent some time trying to debug the woes with this host, but the behavior is very strange.

Things I tried.

  1. Reset Bios to optimized defaults
  2. Re-installed the same version of the Bios, while discarding all settings except SMBIOS
  3. Issued a cold reset to the BMC

But, none of my actions changed the behavior of the BMC, notably issuing a reset /system1/pwrmgtsvc1 or a stop /system1/pwrmgtsvc1 command do not seem to have any effect.

@Jhancock.wm me and Jesse are running out of ideas, if you have time could you please open the host and check if the bus between the BMC and the motherboard etc.. is working as expected? From our point of view it seems as if the BMC was disconnected from the host, we cannot powercycle it of seeing anything via mgmt console :(

@Jhancock.wm me and Jesse are running out of ideas, if you have time could you please open the host and check if the bus between the BMC and the motherboard etc.. is working as expected? From our point of view it seems as if the BMC was disconnected from the host, we cannot powercycle it of seeing anything via mgmt console :(

@Jhancock.wm ping :)

@elukey whoops! I actually had to open it in a new browser to get it to work. but should be accessible now. If all else fails, try accessing the console in the webui and clear your cache if you need to. =/

Provisioned the host, retried a reimage, but it didn't boot in d-i. I checked on the DCHP server:

elukey@install2005:~$ sudo journalctl -u isc-dhcp-server.service --since '1 hour ago' | grep 10.192.58.5
Oct 22 13:04:59 install2005 dhcpd[1664598]: DHCPOFFER on 10.192.58.5 to 90:5a:08:9f:08:80 via 10.192.58.1
Oct 22 13:05:03 install2005 dhcpd[1664598]: DHCPREQUEST for 10.192.58.5 (208.80.153.70) from 90:5a:08:9f:08:80 via 10.192.58.1
Oct 22 13:05:03 install2005 dhcpd[1664598]: DHCPACK on 10.192.58.5 to 90:5a:08:9f:08:80 via 10.192.58.1

And I checked on spicerack shell:

>>> pprint(r.request("GET", f'{r.system_manager}/EthernetInterfaces/1').json())
{'@odata.etag': '"a0827973e5317a76aebc26546b840b94"',
 '@odata.id': '/redfish/v1/Systems/1/EthernetInterfaces/1',
 '@odata.type': '#EthernetInterface.v1_12_4.EthernetInterface',
 'Description': '( AOC-A25G-b2SM #1)',
 'FQDN': None,
 'HostName': None,
 'IPv4Addresses': [{'Address': None,
                    'AddressOrigin': None,
                    'Gateway': None,
                    'SubnetMask': None}],
 'Id': '1',
 'InterfaceEnabled': None,

 'LinkStatus': 'LinkUp', <====================

 'Links': {'NetworkDeviceFunctions': [{'@odata.id': '/redfish/v1/Chassis/1/NetworkAdapters/1/NetworkDeviceFunctions/1'}]},
 
 'MACAddress': '90:5a:08:9f:08:80', <====================

 'Name': 'AOC LAN 1',
 'NameServers': [],
 'SpeedMbps': 10000,
 'Status': {'Health': 'OK', 'State': 'Enabled'},
 'VLANs': {'@odata.id': '/redfish/v1/Systems/1/EthernetInterfaces/1/VLANs'}}

So the mac address doing the DHCP request seems the right one, and the interface is seeing linkup.

I think If found a possible lead - HTTPSBootChecksHostname may be the problem, since we use a bare IP when doing the HTTP boot. I am not able to set "Disabled" to the option via spicerack shell though..

>>> pprint(r.request("PATCH", "/redfish/v1/Systems/1/Bios", json={"Attributes": { "HTTPSBootChecksHostname": "Enabled" }}).json())
{'Success': {'code': 'Base.1.10.Success',
             'message': 'Successfully Completed Request.'}}


>>> pprint(r.request("PATCH", "/redfish/v1/Systems/1/Bios", json={"Attributes": { "HTTPSBootChecksHostname": "Disabled" }}).json())
PATCH https://10.193.2.115/redfish/v1/Systems/1/Bios returned HTTP 400
Response payload: {'error': {'code': 'Base.1.10.GeneralError', 'message': 'A general error has occurred. See ExtendedInfo for more information.', '@Message.ExtendedInfo': [{'MessageId': 'Base.1.10.PropertyValueTypeError', 'Severity': 'Warning', 'Resolution': 'Correct the value for the property in the request body and resubmit the request if the operation failed.', 'Message': 'The value \'"Disabled"\' for the property HTTPSBootChecksHostname is of a different type than the property can accept.', 'MessageArgs': ['"Disabled"', 'HTTPSBootChecksHostname'], 'RelatedProperties': ['HTTPSBootChecksHostname']}]}}

I have the feeling that this option cannot be disabled via redfish..

@Jhancock.wm Hi! The host seems stuck again after trying reset /system1/pwrmgtsvc1, it feels like there is something wrong with the host. What do you think?

@elukey found the server up. maybe it takes 5 million years to boot? i remember some of the ms-be supermicro servers had the same issue before with a slightly different model. believe it was the SSG-620P-E1CR24H - ConfigJ. they seriously took 10-15 minutes to reboot.

Something is definitely wrong, since I waited for an hour and -> reset /system1/pwrmgtsvc1 still hanged (not sure if the host rebooted in the meatime). Powercycling via Redfish API seems to work though, no idea what's happening. I tried to watch the console via start /system1/sol1 when the powercycle via Redfish happened, but I couldn't see anything. So the host is kind of doing something, but it is not reflected to the outside world :(

i ran reset /system1/pwrmgtsvc1 with a physical console up to observe. it didn't reboot for me.

i powered it down manually and checked the insides again. reseating as many things as i could reach. it didn't feel like anything was loose but who knows. (It was surprisingly easy to accidentally disconnect the power for the NVME risers.)

i've physically power cycled it again. give it a shot and ping me if you need me to come poke it with a stick again.

It seems to work now! I powercycled it and I now see the console displaying some data, including Trixie booting. I have no idea what went wrong before but now it seems resolved :) Going to leave the task open a little bit more to keep experimenting, I'll close when we are good. Thanks a lot!

I spent a lot of time this morning trying to make a reimage working, but there seems to be something pathologically wrong about this host. I have the feeling that I made it worse when I upgraded BIOS and IDRAC firmwares in T394357#11213539, but I am not 100% sure.

The current problem is that reimage always leads to the host booting into the OS. Things that I tried:

  1. Set the HTTPSBootChecksHostname BIOS option

I figured out the problem with the HTTPSBootChecksHostname BIOS option, the value to disable it is Disabled (WARNING: Security Risk!!) and not simply Disabled, but it didn't make any change to reimage. The option seemed related since, due to a supermicro feature/bug, we have to set an IP address instead of a DNS name among the options of HTTP boot that we pass via DHCP, and I suspected the TLS checks made the boot from NIC failing.

  1. Try to access the BIOS panel manually via console

This turned out to be really difficult, because I don't see anything via /system1/sol1 when the system boot. I just see the OS booting and its login prompt, but nothing related to the host booting and offering the various options to enter BIOS etc.. I then tried to hit DEL via the WebUI's HTML5 console and I was able to enter the BIOS and explore options. I tried to change some of the Boot options settings and I retried reimage, but I ended up with a "No media failure" when trying to PXE/HTTP boot. Progress but not what I expected, these models should try all the NICs available, not just one (at least this is what the X13's Redfish manual says). It also seems that the "No media failure" error is not consistent, namely it doesn't happen every time that I try (that brings to me to the thought that the boot is not always forced to the network, because it seems to try it randomly).

It worked like this for ML hosts from Supermicro that were UEFI only, so not sure why this one doesn't.

  1. Tried to force the Boot from Network

I then tried to reimage (so DHCP settings would be generated and set up), then I used the HTML5 console from the WebUI to hit F11 and enter the menu' to select the right NIC. I chose the one with MAC 90:5a:08:9f:08:80 because as written in T394357#11298181, this is the one with LinkUp. The operation went well and the Debian Installer was correctly booted.

I don't recall the previous version of the BIOS/IDRAC firmwares and I don't see older ones available on the Supermicro website, so I am wondering if there is something wrong with the host or with the firmwares (or both).

I may have found something interesting:

>>> pprint(r.request("GET", "/redfish/v1/Systems/1/Oem/Supermicro/FixedBootOrder").json())
{'@odata.etag': '"c9fb8f99ef20e503cbb37b143df10d09"',
 '@odata.id': '/redfish/v1/Systems/1/Oem/Supermicro/FixedBootOrder',
 '@odata.type': '#SmcFixedBootOrder.v1_0_0.SmcFixedBootOrder',
 'BootModeSelected': 'UEFI',
 'FixedBootOrder': ['UEFI Hard Disk:debian (HDD,Port:FFFF)',
                    'UEFI Network:(B111/D0/F0) UEFI HTTP IPv4 Broadcom Network '
                    'Device - 90:5A:08:9F:08:80(MAC:905A089F0880)',
                    'UEFI USB Hard Disk',
                    'UEFI USB CD/DVD',
                    'UEFI USB Key',
                    'UEFI USB Floppy',
                    'UEFI USB Lan',
                    'UEFI CD/DVD',
                    'UEFI AP:UEFI: Built-in EFI Shell'],
 'FixedBootOrderDisabledItem': ['Disabled'],
 'Id': '1',
 'Name': 'Fixed Boot Order',
 'UEFIAP': ['UEFI: Built-in EFI Shell'],
 'UEFIAPDisabledItem': ['Disabled'],
 'UEFIHardDisk': ['debian (HDD,Port:FFFF)', 'debian (HDD,Port:1)'],
 'UEFIHardDiskDisabledItem': ['Disabled'],
 'UEFINetwork': ['(B111/D0/F0) UEFI HTTP IPv4 Broadcom Network Device - '
                 '90:5A:08:9F:08:80(MAC:905A089F0880)',
                 '(B199/D0/F1) UEFI HTTP IPv4 Intel(R) I350 Gigabit Network '
                 'Connection(MAC:905A08129D11)',
                 '(B199/D0/F0) UEFI HTTP IPv4 Intel(R) I350 Gigabit Network '
                 'Connection(MAC:905A08129D10)',
                 '(B111/D0/F1) UEFI HTTP IPv4 Broadcom Network Device - '
                 '90:5A:08:9F:08:81(MAC:905A089F0881)'],
 'UEFINetworkDisabledItem': ['Disabled']}

This specific API is available for X13+ models, and up-to-now this meant only high end ML hosts (all the other supermicros were X12 that don't have this option). On ml-serve1012 (one of the high end ML hosts) I didn't have to modify this API since there was a special instruction in the BIOS that forced a HTTP Boot try on all NIC interfaces, that is also present for sretest2010: 'HTTPBootPolicy': 'Apply to each LAN',. For some reason it doesn't seem to work for sretest2010, because I had to explicitly set (via BIOS) the first device of the UEFINetwork section to the right NIC/MAC combination (the one with the link up, 90:5A:08:9F:08:80). After setting it, I was able to see the Debian Installer booting.

IIUC FixedBootOrder lists the first item of UEFINetwork, and apparently only uses it to attempt a HTTP UEFI boot. It may be related to some peculiarities of the motherboard, but if so it means more special cases for us to handle while provisioning :(

I'll keep doing more tests..

Really interesting, I retried today a reimage and got a "no media present" when trying to pxe/http boot. Then I checked the Boot order and the wrong UEFI network card is listed as first:

'FixedBootOrder': ['UEFI Hard Disk:debian (HDD,Port:FFFF)',
                   'UEFI Network:(B199/D0/F1) UEFI HTTP IPv4 Intel(R) I350 '
                   'Gigabit Network Connection(MAC:905A08129D11)',
                   'UEFI USB Hard Disk',
                   'UEFI USB CD/DVD',
                   'UEFI USB Key',
                   'UEFI USB Floppy',
                   'UEFI USB Lan',
                   'UEFI CD/DVD',
                   'UEFI AP:UEFI: Built-in EFI Shell'],

So it seems as if the FixedBootOrder is not preserved :(

In theory the HttpBootPolicy should hit the right HTTP boot after some tries without stopping at the first failure:

['(B199/D0/F0) UEFI HTTP IPv4 Intel(R) I350 Gigabit Network '
 'Connection(MAC:905A08129D10)',
 '(B111/D0/F1) UEFI HTTP IPv4 Broadcom Network Device - '
 '90:5A:08:9F:08:81(MAC:905A089F0881)',
 '(B111/D0/F0) UEFI HTTP IPv4 Broadcom Network Device - '
 '90:5A:08:9F:08:80(MAC:905A089F0880)',                                                    <<<========== NIC with LinkUp
 '(B199/D0/F1) UEFI HTTP IPv4 Intel(R) I350 Gigabit Network '
 'Connection(MAC:905A08129D11)',
 '(B199/D0/F0) UEFI PXE IPv4 Intel(R) I350 Gigabit Network '
 'Connection(MAC:905A08129D10)',
 '(B111/D0/F0) UEFI PXE IPv4 Broadcom Network Device - '
 '90:5A:08:9F:08:80(MAC:905A089F0880)',
 '(B111/D0/F1) UEFI PXE IPv4 Broadcom Network Device - '
 '90:5A:08:9F:08:81(MAC:905A089F0881)',
 '(B199/D0/F1) UEFI PXE IPv4 Intel(R) I350 Gigabit Network '
 'Connection(MAC:905A08129D11)',
 '(B111/D0/F0) UEFI PXE IPv6 Broadcom Network Device - '
 '90:5A:08:9F:08:80(MAC:905A089F0880)',
 '(B111/D0/F1) UEFI PXE IPv6 Broadcom Network Device - '
 '90:5A:08:9F:08:81(MAC:905A089F0881)',
 '(B199/D0/F0) UEFI PXE IPv6 Intel(R) I350 Gigabit Network '
 'Connection(MAC:905A08129D10)',
 '(B199/D0/F1) UEFI PXE IPv6 Intel(R) I350 Gigabit Network '
 'Connection(MAC:905A08129D11)']

I checked on ml-serve1012 and the NIC with LinkUp is not the first, so something seems not working with sretest2010.

I've set up the UEFINetwork list with 90:5A:08:9F:08:80 UEFI HTTP first, and it got reflected to FixedBootOrder. Ran a chassis reset, waited for the os to boot, and then restarted reimage. No media presence issue. Then checked the list again, nothing changed, restarted reimage again, HTTP boot started without any issue.

I am reasonably sure that something horrible is happening between the BMC/BIOS firmware and possibly the NIC firmwares, it is worth opening a task to Supermicro.

I opened a Supermicro ticket to explain the problem, we'll see if they have suggestions.

RobH mentioned this in Unknown Object (Task).Jan 6 2026, 7:56 PM
RobH mentioned this in Unknown Object (Task).
Jhancock.wm mentioned this in Unknown Object (Task).Mon, Apr 20, 1:50 PM

@elukey did we get anything back from SM on the ticket you opened for this one?

Update: I tested the new SM firmwares for BIOS and BMC, but the latter seems leading to an inconsistent state: the update doesn't start because of a weird issue, and it is reproducible. I already contacted SM with this info, we'll see if they can suggest to us how to proceed. The BIOS firmware has been applied, but I think that the BMC one is the one responsible for the issue outlined above :(

Hey @elukey - do you have the Supermicro case number for this one? Thanks, Willy

@wiki_willy sorry for the lag, didn't see your question! 00062974

I keep getting this error when reimage:

Server IP address is ...208.80.153.70
NBP filename is http://208.80.154.10/efiboot/snponly.efi
NBP filesize is 0 Bytes
PXE-E23: Client received TFTP error from server.

That is similar to what we got in the past: usually it means that HttpBoot options are not set up correctly. I checked and they are, so this is very weird. Moreover, one time I was able to kick off a reimage that worked, and I am not sure what changed.

Thanks @elukey, I went ahead and sent it over to Ken from Supermicro, so that he can try to push this along a bit quicker.

@wiki_willy sorry for the lag, didn't see your question! 00062974