Page MenuHomePhabricator

Q3:rack/setup/install ganeti20[45-50]
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of ganeti20[45-50]

Hostname / Racking / Installation Details

Hostnames: ganeti20[45-50]
Racking Proposal: These are replacing four nodes in B, so these need to get to B again. And two in A, which also need to go in A.
Networking Setup: # 10G - VLAN setup like existing Ganeti nodes
OS Distro: Bookworm
Sub-team Technical Contact: @MoritzMuehlenhoff

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

ganeti2045:
  • Receive in system on procurement task T382898 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti2046:
  • Receive in system on procurement task T382898 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti2047:
  • Receive in system on procurement task T382898 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti2048:
  • Receive in system on procurement task T382898 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti2049:
  • Receive in system on procurement task T382898 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook
ganeti2050:
  • Receive in system on procurement task T382898 & in Coupa
  • Rack system with proposed racking plan (see above) & update Netbox (include all system info plus location, state of planned)
  • Run the Provision a server's network attributes Netbox script - Note that you must run the DNS and Provision cookbook after completing this step
  • Immediately run the sre.dns.netbox cookbook
  • Immediately run the sre.hosts.provision cookbook
  • Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with roles defined by service group: https://wikitech.wikimedia.org/wiki/SRE/Dc-operations
  • Run the sre.hosts.reimage cookbook

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Papaul i found a weird little thing. I racked ganeti2049 in B5, U40. There are three other servers in the same set of 4 on the switch. two of them are set to SFP+ (10GE). The other server, centrallog2002, is set to 10GBASE-T (10GE). Because of this I haven't been able to provision ganeti2049 on netbox.
Is it possible that netbox considers them distinct?
Is this something I can update on netbox without affecting the network?
I didn't want to change anything without checking first.

@Jhancock.wm thanks for checking. I see in netbox that ganetti2049 is rack in B4 and U41 and not U40 like you mentioned so i am guessing that you want to use port 40 for it on the switch but interface xe-0/0/10 doesn't exist in netbox.
What you need to do in netbox is to add interface xe-0/0/40 on lsw1-5-codfw with type SEP+10GE after you add the interface you can then run the provision script in netbox.

Let me know if you have any questions on how to add an interface in netbox.

@Papaul i found a weird little thing. I racked ganeti2049 in B5, U40. There are three other servers in the same set of 4 on the switch. two of them are set to SFP+ (10GE). The other server, centrallog2002, is set to 10GBASE-T (10GE). Because of this I haven't been able to provision ganeti2049 on netbox.
Is it possible that netbox considers them distinct?

Yeah the port-block validation for Juniper looks at the "type". Which gets set based on the speed we use.

Automation should only add them as "SFP+ (10GE)", but that particular one was incorrectly added manually as "10GBASE-T". Even though they are both 10G types the validator expects them to be the exact same or it will fail.

Is this something I can update on netbox without affecting the network?

Yeah it's no problem to change. I changed xe-0/0/43 to "SFP+ 10G" now so you should be ok to proceed.

@Jhancock.wm one thing to make sure is all ganeti hosts are added to row-wide vlans.

So in the Provision Script leave "VLAN Type" empty and select the row-wide vlan manually beside "VLAN", e.g:

image.png (235×819 px, 40 KB)

I opened T388005 for us to fix up the ones that got done the other way already. Thanks!

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2045.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

having an issue with getting the last 4 provisioned. I run the provisioning script but it times out on the redfish call.

Retrieving the BMC's firmware version.
Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 
'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f64bfb4dca0>, 'Connection to 10.193.1.71 timed out. (connect timeout=10)')': 
/redfish/v1/UpdateService/FirmwareInventory/BMC
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 
'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f64bfb4d970>, 'Connection to 10.193.1.71 timed out. (connect timeout=10)')': 
/redfish/v1/UpdateService/FirmwareInventory/BMC
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 
'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f64bfb01c70>, 'Connection to 10.193.1.71 timed out. (connect timeout=10)')': 
/redfish/v1/UpdateService/FirmwareInventory/BMC
Exception raised while initializing the Cookbook sre.hosts.provision:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in _new_conn
  conn = connection.create_connection(
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 96, in create_connection
  raise err
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 86, in create_connection
  sock.connect(sa)
socket.timeout: timed out

saw this at the end.

raise RedfishError(str(e)) from e
spicerack.redfish.RedfishError: Failed to perform GET request to https://10.193.1.71/redfish/v1/UpdateService/FirmwareInventory/BMC

i thought it might be another instance where the bmc password wasn't set to our requested default. i set the BMC ip manually on the server so i could test it on a browser, but the default password is set correctly.

I tried changing the port off the private vlan type pointed out earlier, but it still fails in the same way.
I am using the --enable-virtualization tag, and have tried it with and without the --uefi tag, but it always fails in the same spot. Not sure of the cause.

servers are powered on and i confirmed the mgmt ports on all of them show activity on the physical level.

@elukey @cmooney @Papaul any thoughts on this one? not in a rush since we're waiting on additional drives for these.

JFTR, I started a patch to add a Partman config with EFI, so we should be good to use UEFI with these servers eventually once reviewed/merged.

@Jhancock.wm I was able to run the cookbook for 2047 but I guess it is the one that you've set the BMC IP manually, so I moved to 2048. I was able to repro the issue, so I went on install2004 and I found:

Mar 18 15:57:51 install2004 dhcpd[1821940]: DHCPDISCOVER from 7c:c2:55:54:44:39 via 10.193.0.1: network 10.193.0.0/16: no free leases

So something is off on the DHCP side, and the BMC doesn't get a DHCP reply with correct data to set.

Change #1128914 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/cookbooks@master] sre.hosts.provision: retrieve Supermicro's BMC firmware after DHCP

https://gerrit.wikimedia.org/r/1128914

It turned out to be my fault! I sent a fix (https://gerrit.wikimedia.org/r/1128914), once it gets merged the provision cookbook should run as expected. Sorry!

all good. Thanks for your help! (i just assumed it was me and i missed something)

Change #1128914 merged by Elukey:

[operations/cookbooks@master] sre.hosts.provision: retrieve Supermicro's BMC firmware after DHCP

https://gerrit.wikimedia.org/r/1128914

Change #1131293 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch new ganeti servers to use EFI

https://gerrit.wikimedia.org/r/1131293

Change #1131293 merged by Muehlenhoff:

[operations/puppet@production] Switch new ganeti servers to use EFI

https://gerrit.wikimedia.org/r/1131293

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2045.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

heads up. i got the new drives in and installed them. i redid the provisioning successfully. when i tried to image ganeti2045, I got an error during the installer about the disks again. Wanted to confirm that there is no hardware raid on these machines? If not, the partman recipe might need a second look. If there's anything I'm missing please let me know so i can go chase it down.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

@MoritzMuehlenhoff could you check the preseed file is correct for me? I'm getting an error on the partitioning section of the installer. I think that might be the cause of the issue. Thanks!

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2045.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2050.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2050.codfw.wmnet with OS bookworm completed:

  • ganeti2050 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504211741_jhancock_3444753_ganeti2050.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

figured it out. gonna finish the rest this evening :fingers-crossed:

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2049.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2045.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2049.codfw.wmnet with OS bookworm completed:

  • ganeti2049 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504240119_jhancock_2664099_ganeti2049.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2048 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2048.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2047.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2046 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2046.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2045.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2047 (FAIL)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2047.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm completed:

  • ganeti2046 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202504282108_jhancock_1346357_ganeti2046.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2045 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2045.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2047 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2047.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

@Papaul can you take a look at this one. 2047 is installed on 2048 and 2048 is installed on 2047. not sure where the swap happened. i checked the serial numbers and BMC mac addresses this morning. they are in different racks so it's not a port issue. Can you take a look and let me know what i missed. Thanks!

@Jhancock.wm you have mismatch on serial number in netbox 91 is ganeti2047 and and 90 is ganeti2048

@Papaul so fun time. the external labels for the serial numbers on these servers got swapped. gonna update netbox to match internal. reimage, and then see if i can get the physical labels moved to the correct server without damaging them. I'll put this in your master supermicro tracking task too.

photo_2025-05-01_09-25-59.jpg (693×1 px, 78 KB)

photo_2025-05-01_09-26-01.jpg (258×1 px, 38 KB)

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm completed:

  • ganeti2045 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505011542_jhancock_1225700_ganeti2045.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2047.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2048 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2048.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2047 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2047.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

@Papaul ganeti2047 tried to connect to the wrong puppetserver. failed there.

[8/10, retrying in 640.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for ganeti2047.codfw.wmnet
[9/10, retrying in 1280.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for ganeti2047.codfw.wmnet

can you delete it from the puppetserver when you can? i'll take another swing at it this evening.

(i know we're both still trying to figure out what's wrong with 2048. pinging here about 2047 to be picked up later rather than confuse the conversation)

@jjanhone both 47 and 48 were on the wrong puppetserver. Remove all yours

sudo puppet cert --list
Warning: `puppet cert` is deprecated and will be removed in a future release.
   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')
  "ganeti2047.codfw.wmnet" (SHA256) F9:79:73:FD:FB:2C:35:C1:30:83:B2:15:1F:1F:47:1A:AC:27:55:1F:C3:A1:C4:2E:88:CF:90:
  "ganeti2048.codfw.wmnet" (SHA256) A7:36:F4:26:53:91:58:13:A7:E1:CC:2D:07:D6:F3:D3:72:44:8A:B1:CA:E4:36:38:34:E4:D9:7C:D

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm executed with errors:

  • ganeti2048 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console ganeti2048.codfw.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2048.codfw.wmnet with OS bookworm completed:

  • ganeti2048 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505082036_jhancock_3357775_ganeti2048.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually

Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2047.codfw.wmnet with OS bookworm completed:

  • ganeti2047 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202505082040_jhancock_3357420_ganeti2047.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • Failed to run the sre.puppet.sync-netbox-hiera cookbook, run it manually
Jhancock.wm updated the task description. (Show Details)

@MoritzMuehlenhoff this is finally done. thanks for your patience!