Page MenuHomePhabricator

Bring an-presto10[16-20] into service to replace an-presto100[1-5]
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Sep 17 2024, 10:50 AM
Referenced Files
Restricted File
Oct 31 2024, 7:53 AM
F57658670: image.png
Oct 30 2024, 12:30 PM

Description

The warranty on an-presto100[1-5] has now expired, so they are due for a refresh.

an-presto10[16-20] are ready to be brought into service now.

  • create keytabs
  • add the hosts to site.pp
  • reimage the hosts to bullseye: all hosts except an-presto1018 (which is having DRAC issues, see subticket) are back on Bullseye.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Gehel triaged this task as Medium priority.Sep 24 2024, 2:28 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.
Gehel moved this task from Scratch to Hardware refresh on the Data-Platform-SRE board.

Change #1083755 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Add dummy keytabs for new presto hosts

https://gerrit.wikimedia.org/r/1083755

Change #1083756 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add new presto hosts to presto cluster

https://gerrit.wikimedia.org/r/1083756

Change #1083755 merged by Stevemunene:

[labs/private@master] Add dummy keytabs for new presto hosts

https://gerrit.wikimedia.org/r/1083755

Change #1083756 merged by Stevemunene:

[operations/puppet@production] Add new presto hosts to presto cluster

https://gerrit.wikimedia.org/r/1083756

Having. look at some Debian12 (bookworm) related issues/conflicts on the new hosts

Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20241029-3688111-tiecuw.lock does not exist or is a dangling symbolic link (file: /srv/puppet_code/environments/production/modules/presto/manifests/server.pp, line: 96)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20241029-3688111-tiecuw.lock does not exist or is a dangling symbolic link (file: /srv/puppet_code/environments/production/modules/presto/manifests/server.pp, line: 96)
Wrapped exception:
No such file or directory - A directory component in /etc/presto/jvm.config20241029-3688111-tiecuw.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Presto::Server/File[/etc/presto/jvm.config]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/presto/jvm.config20241029-3688111-tiecuw.lock does not exist or is a dangling symbolic link (file: /srv/puppet_code/environments/production/modules/presto/manifests/server.pp, line: 96)

Error: Could not prefetch package provider 'apt': Execution of '/usr/bin/apt-mark showmanual' returned 100: E: Conflicting values set for option Signed-By regarding source http://apt.wikimedia.org/wikimedia/ bookworm-wikimedia: /etc/apt/keyrings/wikimedia-archive-keyring.gpg != 
E: The list of sources could not be read.

We are yet to fully support debian 12, hence there are still multiple packages that need to be ported to bookworm. For now we shall reimage the hosts to bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1016 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • New OS is bookworm but bullseye was requested
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye

Boot keeps failing with, the below failed to load ldlinux.c32 investigating

image.png (1×1 px, 582 KB)

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Icinga downtime and Alertmanager silence (ID=3af47a4d-1c41-4c5a-9077-135ce9e95829) set by stevemunene@cumin1002 for 5 days, 0:00:00 on 3 host(s) and their services with reason: reimaging the hosts to bullseye

an-presto[1017-1019].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=ee189dc9-c5dd-498b-b417-60f81f4c1b39) set by stevemunene@cumin1002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: reimaging the hosts to bullseye

an-presto1020.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1017.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1017.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1017 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1017.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1017.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1017.eqiad.wmnet with OS bullseye completed:

  • an-presto1017 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202410301601_stevemunene_3447320_an-presto1017.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Found the root cause of the failed to load ldlinux.c32 error which was related to T304483 and is fixed by https://phabricator.wikimedia.org/T363576#10017772 adding the --force-dhcp-tftp flag to force pxelinux.0 and tftp only (no http). an-presto1017 was reimaged successfully with this.

an-presto1016 and 1019 seem to be having issues booting, probably issues with the boot loader/partitioning which I am having a look at.
{F57659833}

an-presto1017 is having some errors starting the presto-server similar to one we had encountered previously on

Oct 30 16:35:51 an-presto1017 presto-server[8671]: /usr/bin/env: ‘python’: No such file or directory
Oct 30 16:46:00 an-presto1017 presto-server[11651]: /usr/bin/env: ‘python’: No such file or directory
Oct 30 16:47:29 an-presto1017 presto-server[12255]: /usr/bin/env: ‘python’: No such file or directory

https://phabricator.wikimedia.org/T323783#8438884 and resolved it by installing the python-is-python3 package. The puppetized change however had to be reverted due to issues with hive https://phabricator.wikimedia.org/rOPUP587c8ed71b99c19845e68f900778c2df64d3a98f
The package however is still installed on the presto hosts

stevemunene@an-presto1015:~$ apt-cache policy python-is-python3
python-is-python3:
  Installed: 3.9.2-1
  Candidate: 3.9.2-1
  Version table:
 *** 3.9.2-1 500
        500 http://mirrors.wikimedia.org/debian bullseye/main amd64 Packages
        100 /var/lib/dpkg/status

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1019.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1019 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye

Boot issues are from the partman recipe which has sda and sdb hardcoded and for the hosts with a smaller sda the recipe fails re: https://github.com/wikimedia/operations-puppet/blob/production/modules/install_server/files/autoinstall/partman/custom/analytics-presto-worker.cfg#L38C1-L78C1
commit
https://gerrit.wikimedia.org/r/c/operations/puppet/+/890488 and as per the comments looking to see if we can refactor a standard recipe from this

Change #1085357 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] netboot: create dedicated partman recipe for certain presto workers

https://gerrit.wikimedia.org/r/1085357

Change #1085357 merged by Stevemunene:

[operations/puppet@production] netboot: create dedicated partman recipe for certain presto workers

https://gerrit.wikimedia.org/r/1085357

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1020.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1020 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1020.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-presto1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1019 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1019 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1019 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1019.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1019.eqiad.wmnet with OS bullseye completed:

  • an-presto1019 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411010125_bking_2007894_an-presto1019.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bookworm executed with errors:

  • an-presto1020 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1020.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1020 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1020.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Per IRC conversation with @Papaul , the newer hosts need TFTP to reimage successfully. Thus, when reimaging, we should use the --force-dhcp-tftp flag, a la sudo cookbook sre.hosts.reimage --force-dhcp-tftp --new --os bullseye an-presto1020 -t T374924

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1020 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1020.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bullseye

Created T378835 and T378824 to hopefully address some of the problems we have experienced with these reimages. I'll also update our docs to include Dell device enumeration, as that was necessary to get the desired ordering of block devices (dev/sda for the system disk, /dev/sdb for the hardware RAID virtual disk)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1020.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1020 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411011636_bking_2149966_an-presto1020.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1020.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host an-presto1016.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2024-11-01T19:47:14Z] <inflatador> bking@an-presto[1016:1020].eqiad.wmnet temporarily install perccli to check disk status without requiring reboot T374924

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host an-presto1016.eqiad.wmnet with OS bullseye completed:

  • an-presto1016 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411011934_bking_2179304_an-presto1016.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1020.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1020 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1020.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-presto1016.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1016 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1016.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-presto1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-presto1018.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1018 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1018.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-presto1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-presto1018.eqiad.wmnet with OS bullseye executed with errors:

  • an-presto1018 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-presto1018.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-presto1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-presto1018.eqiad.wmnet with OS bullseye completed:

  • an-presto1018 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411181256_btullis_2978506_an-presto1018.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB