Page MenuHomePhabricator

Q2:rack/setup/install ganeti105[34].eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ganeti105[34].eqiad.wmnet

Hostname / Racking / Installation Details

Hostnames: ganeti1053.eqiad.wmnet, ganeti1054.eqiad.wmnet
Racking Proposal: Both a row A
Networking Setup: same VLAN/IP setup as existing Ganeti servers
OS Distro: Bookworm
Sub-team Technical Contact: Moritz

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ganeti1053
  • Receive in system on procurement task T376164 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti1054
  • Receive in system on procurement task T376164 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Run the sre.hardware.upgrade-firmware cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
RobH unsubscribed.

Mortiz,

Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. This is due to the majority of DC Ops not having root/merge puppet rights.

Please update the site.pp file with the insetup role for your team (detailed on https://wikitech.wikimedia.org/wiki/SRE/Dc-operations) and add the new servers to preseed.yml for partition info.

If possible, please reference this task number in your patch set, so it is clear when complete. Once complete, just un-assign yourself (leaving no assignee) for this task and once the hardware arrives on-sites will claim this task for racking and setup. Please don't re-subscribe me to this task unless there is a direct question for me.

Thank you!

Change #1101031 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti1053/1054 to site.pp

https://gerrit.wikimedia.org/r/1101031

Change #1101031 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti1053/1054 to site.pp

https://gerrit.wikimedia.org/r/1101031

The servers are failing provision, I will take another look at it later.

@VRiley-WMF can you verify the SNs for these two servers. they should end with either 391 or 392.

@elukey these servers are not provisioning at all. i can ping the the mgmt of ganeti1053 for some reason but i can't get logged into it. Not finding any devices with a dulpicate mgmt IP either. I noticed the SNs didn't match what was on the shipping page. I tried both SNs but couldn't get it to work. I'm 99% sure i didn't just typo the cookbook command but at this point i have no idea.

jhancock@cumin2002:~$ sudo secure-cookbook sre.hosts.provision ganeti1053
Exception raised while initializing the Cookbook sre.hosts.provision:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 205, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 107, in get_runner
  raise RuntimeError(
RuntimeError: Virtualization not enabled but this host will need it.
jhancock@cumin2002:~$ sudo secure-cookbook sre.hosts.provision ganeti1054. 
Exception raised while initializing the Cookbook sre.hosts.provision:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 205, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 107, in get_runner
  raise RuntimeError(
RuntimeError: Virtualization not enabled but this host will need it.
jhancock@cumin2002:~$ sudo secure-cookbook sre.hosts.provision ganeti1053 --no-user --no-dhcp
Exception raised while initializing the Cookbook sre.hosts.provision:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 205, in run
  runner = self.instance.get_runner(args)
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 107, in get_runner
  raise RuntimeError(
RuntimeError: Virtualization not enabled but this host will need it.

@Jhancock.wm i checked the serial number on 1053, it is the serial number ending with 392. Trying re-running the cookbook with the --enable-virtualization flag
since the server has already the user and ip address setup run the full command

sudo secure-cookbook sre.hosts.provision ganeti1053 --no-user --no-dhcp  --enable-virtualization flag

if this doesn't work, you can paste the output here and send it to @elukey . Thank you

that's a new flag for me. ty. it did work and it at least started this time. but it did crash. at a similar spot to two SM servers in codfw

BIOS: RSC_DR_6G4PCI_E4_0X16OPROM is set to EFI, while we want Legacy
BIOS: RSC_D_6G4PCI_E4_0X16OPROM is set to EFI, while we want Legacy
Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).
Applying BIOS settings...
PATCH https://10.65.1.77/redfish/v1/Systems/1/Bios returned HTTP 400
Response payload: {'error': {'code': 'Base.v1_10_3.GeneralError', 'message': 'A general error has occurred. See ExtendedInfo for more information.', 
'@Message.ExtendedInfo': [{'MessageId': 'Base.1.10.PropertyValueTypeError', 'Severity': 'Warning', 'Resolution': 'Correct the value for the property in the request body 
and resubmit the request if the operation failed.', 'Message': "The value 'null' for the property P1_AIOMAOC_AG_i2LAN1OPROM is of a different type than the 
property can accept.", 'MessageArgs': ['null', 'P1_AIOMAOC_AG_i2LAN1OPROM'], 'RelatedProperties': ['P1_AIOMAOC_AG_i2LAN1OPROM']}]}}
Exception raised while executing cookbook sre.hosts.provision:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/redfish.py", line 372, in request
  return self._api_client.request(method, uri, **kwargs)
File "/usr/lib/python3/dist-packages/spicerack/apiclient.py", line 101, in request
  raise APIClientResponseError(response)
spicerack.apiclient.APIClientResponseError: PATCH https://10.65.1.77/redfish/v1/Systems/1/Bios returned HTTP 400

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 404, in _config_host
  self._patch_bios_settings()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 353, in _patch_bios_settings
  self.redfish.request(
File "/usr/lib/python3/dist-packages/spicerack/redfish.py", line 378, in request
  raise RedfishError(str(e)) from e
spicerack.redfish.RedfishError: PATCH https://10.65.1.77/redfish/v1/Systems/1/Bios returned HTTP 400

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 265, in _run
  raw_ret = runner.run()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 271, in run
  self._config_host()
File "/srv/deployment/spicerack/cookbooks/sre/hosts/provision.py", line 420, in _config_host
  raise RuntimeError(
RuntimeError: Error while configuring BIOS or mgmt interface: PATCH https://10.65.1.77/redfish/v1/Systems/1/Bios returned HTTP 400
Released lock for key /spicerack/locks/cookbooks/sre.hosts.provision:ganeti1053: {'concurrency': 1, 'created': '2025-02-04 15:50:18.996058', 'owner': 'jhancock@cumin2002 [1530399]', 'ttl': 1800}
END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART

@Jhancock.wm thank you for working on this. Yes for all ganeti nodes we enable virtualization because these nodes will be running VM's . Back to you error

spicerack.redfish.RedfishError: PATCH https://10.65.1.77/redfish/v1/Systems/1/Bios returned HTTP 400

The Redfish API did sent a request to the server but it was rejected by the server because of a bad request. so what is that bad request? may be check if the --enable-virtualization flag that works with the Dell nodes works the same also with the supermicro nodes. I don't know so I will leave this to @Volans and at @elukey

Thank you.

Very interesting - running the provision cookbook with --uefi worked fine, then I retried to restore legacy/bios (removing --uefi from the cmd line) and again I got the same error.

This is very weird - I went to BIOS and selected BIOS Mode UEFI, then reselected Legacy. Saved and reset. re-ran the cookbook and:

elukey@cumin1002:~$ sudo cookbook sre.hosts.provision --enable-virtualization ganeti1053 --no-user --no-switch --no-dhcp
Management Password: 
Using the BMC's MAC address for the DHCP config.
Testing Redfish API connection to cumin2002 (10.193.0.139)
==> Are you sure to proceed to apply BIOS/iDRAC settings for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Acquired lock for key /spicerack/locks/cookbooks/sre.hosts.provision:ganeti1053: {'concurrency': 1, 'created': '2025-02-04 17:42:30.570408', 'owner': 'elukey@cumin1002 [1504576]', 'ttl': 1800}
START - Cookbook sre.hosts.provision for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
Testing Redfish API connection to ganeti1053 (10.65.1.77)
Retrieving the BMC's firmware version.
BMC firmware release date: 2024-08-27T00:00:00Z
Retrieving BIOS settings (first round).
Retrieving updated BIOS settings...
Setting up BootMode and basic BIOS settings.
No BIOS settings applied since the config is already good.
Retrieving BIOS settings (second round).
Retrieving updated BIOS settings...
BIOS: IPv4HTTPSupport is set to Enabled, while we want Disabled
BIOS: IPv4PXESupport is set to Disabled, while we want Enabled
Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).
Applying BIOS settings...
Applying Network changes to the BMC.
Rebooting the host with policy ChassisResetPolicy.FORCE_RESTART and waiting for 5 minutes
Resetting chassis power status for ganeti1053 to ForceRestart
Testing Redfish API connection to ganeti1053 (10.65.1.77)
Skipping root user password change
Running IPMI command: ipmitool -I lanplus -H ganeti1053.mgmt.eqiad.wmnet -U root -E chassis power status
Released lock for key /spicerack/locks/cookbooks/sre.hosts.provision:ganeti1053: {'concurrency': 1, 'created': '2025-02-04 17:42:30.570408', 'owner': 'elukey@cumin1002 [1504576]', 'ttl': 1800}
END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti1053.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART

I double checked via Redfish and P1_AIOMAOC_AG_i2LAN1OPROM is set to PXE (as expected).

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1054 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1054.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1054 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1054.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1054 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1054.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1054 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1054.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

@MoritzMuehlenhoff I have tried to finished up with this reimage, however it seems that the preseed on this is off with how many drives are supposed to be there. It seems to refer to it having 4 drives and not 8 (it currently has 8 drives in these servers)

Hmmh, I'm not sure why these have eight drives? These are config C, so they should simply have 4x960G SSDs, right? Did Supermicro send a wrong build? We also can't really make sensible use of the extra capacity, so these if we were send in error, we could return them or otherwise keep them as spares?

@VRiley-WMF I checked on the packing slip, it said the each server has 4 drives but when i login to 1053 in the BIOS i see only 2 drives installed. Can you please physically check the number of drives each server has?

Thank you

Hey @Papaul You're correct, I do apologize about that. The drive blanks (fillers for empty slots) made it seem like it was different. There are only 2 drives per server.

@VRiley-WMF no problem. Can you send an email to our Rep and attach the packing slip to the email to let him know that we supposed to received 4 disks per servers like it said on the packing slip but the servers can only with 2 disks per servers that we are missing 2 disks per servers.

Thank you.

Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you were looking at the column for the number of DIMMs (which is 4x DIMMs for Config C) instead of the hard drives below:

https://docs.google.com/spreadsheets/d/1y3kh8JAYlb3VqJOazwq7y6EksIKlBw-GY5IaWXqQ_VA/edit?gid=1108734831#gid=1108734831

Since the recent codfw install T382898 was also a Config C with 2x drives across 6 servers...for next steps, can you shoot open a procurement task to get the remaining hard drives ordered? (4x 960gb for eqiad and 12x 960gb for (cc @RobH)

https://phabricator.wikimedia.org/maniphest/task/edit/form/66/

Thanks,
Willy

Hmmh, I'm not sure why these have eight drives? These are config C, so they should simply have 4x960G SSDs, right? Did Supermicro send a wrong build? We also can't really make sensible use of the extra capacity, so these if we were send in error, we could return them or otherwise keep them as spares?

Hi @MoritzMuehlenhoff - the normal hardware specs for Config C is actually 2x 960gb hard drives (not 4x 960gb). I think maybe you were looking at the column for the number of DIMMs (which is 4x DIMMs for Config C) instead of the hard drives below:

https://docs.google.com/spreadsheets/d/1y3kh8JAYlb3VqJOazwq7y6EksIKlBw-GY5IaWXqQ_VA/edit?gid=1108734831#gid=1108734831

Ah, that explains. Thanks for the pointer.

Since the recent codfw install T382898 was also a Config C with 2x drives across 6 servers...for next steps, can you shoot open a procurement task to get the remaining hard drives ordered? (4x 960gb for eqiad and 12x 960gb for (cc @RobH)

Hmmh, I'm not sure why these have eight drives? These are config C, so they should simply have 4x960G SSDs, right? Did Supermicro send a wrong build? We also can't really make sensible use of the extra capacity, so these if we were send in error, we could return them or otherwise keep them as spares?

Absolutely, created https://phabricator.wikimedia.org/T388816 for this.

Change #1131293 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch new ganeti servers to use EFI

https://gerrit.wikimedia.org/r/1131293

Change #1131293 merged by Muehlenhoff:

[operations/puppet@production] Switch new ganeti servers to use EFI

https://gerrit.wikimedia.org/r/1131293

A quick note: Since the new SSDs are now added (T390319): These servers are now configured to a UEFI-compatible Partman config, as such these will need to be re-provisioned to UEFI before they can be installed. As such, I've removed the relevant tick box.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Currently attepting to image these servers, however, it seems it's not being detected after reboot. Will continue to investigate this issue.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1054 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1054.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

For the reimages to succeed, these need to be re-provisioned with EFI, we adapted the install procedure while debugging some installation issues for the codfw counter part (T384838) of the expansion

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1054 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1054.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

While trying to image these servers, it seems to lock up during the reboot with just a generic time out reason. Verified that the servers are set to boot from UEFI mode.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Note for myself:

BIOS - Found a NIC device: P1_AIOMAOC_AG_i2LAN1OPROM
Set PXE to the NIC P1_AIOMAOC_AG_i2LAN1OPROM
BIOS: P1_AIOMAOC_AG_i2LAN1OPROM is set to EFI, while we want PXE
Found differences between our desired status and the current one, applying new BIOS settings (a reboot will be performed).

This keeps happening, so probably this line of servers wants/sets EFI instead of PXE (namely, even if we set PXE it stores EFI).

I tried to reimage after two run of provision with uefi, and this is what I get:

┌────────────────────┤ [!!] Configure the network ├─────────────────────┐    
  │                                                                       │    
  │ The network autoconfiguration was successful. However, no default     │    
  │ route was set: the system does not know how to communicate with hosts │    
  │ on the Internet. This will make it impossible to continue with the    │    
  │ installation unless you have the first image from a set of            │    
  │ installation media, a 'Netinst' image, or packages available on the   │    
  │ local network.                                                        │    
  │                                                                       │    
  │ If you are unsure, you should not continue without a default route:   │    
  │ contact your local network administrator about this problem.          │    
  │                                                                       │    
  │ Continue without a default route?                                     │    
  │                                                                       │    
  │     <Go Back>                                       <Yes>    <No>     │    
  │

@VRiley-WMF Hi! Anything happening on the network on your side? We can't really understand how DHCP could fail in that way, unless some test was ongoing and/or a possible config change needed etc.. lemme know :)

@elukey I believe we were getting this error due to it not being in a 10 gig rack. I have updated the location of ganeti1053. Will attempt to try this process again

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm executed with errors:

  • ganeti1053 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti1053.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm completed:

  • ganeti1053 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507162234_vriley_643342_ganeti1053.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

ganeti1054 has moved into A4 U30

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm completed:

  • ganeti1054 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507171848_vriley_1152194_ganeti1054.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
VRiley-WMF updated the task description. (Show Details)

These have been imaged