Page MenuHomePhabricator

Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234]
Closed, ResolvedPublic

Description

This task will track the receiving and installation of (8) 6.4TB NVMe PCIe SSDs to install into the text cp fleet in ulsfo.

Order was via parent task T359167.

cp40(3[789]|4[01234] are text hosts.

This will be worked on with RobH as on-site engineer and @BCornwall.

SSDs are not expected to arrive unto 2024-05-15.

Proposal of work window and type below, summarized from past IRC discussion with robh and ssingh but subject to correction by ssingh after this task filing.

Scheduling

Cadence:

  • SSDs arrive, Rob updates this task with their arrival.
  • Rob and Brett determine best date for SSD installation and set/annnounce a maintenance window.
  • Brett depools all traffic from ulsfo on scheduled date.
  • Rob goes on-site, graceful shut down on each cp host and installs the PCIe SSD and then powers them back up.
  • Brett & Rob ensure all hosts are back online and accessible.
  • Brett re-pools ulsfo for user traffic
  • Brett/Traffic will reimage the upgraded text CP hosts individually while site is serving traffic. This will take place over the course of days, and not within the intial work maintenance window.

Maintenance window

Event window: 2024-06-12 at 15:00 UTC through 19:00 UTC
Scope: Full depool

Assumptions: it takes roughly 1 hour to depool a site without significant user impact. If it takes longer, the 16:00 UTC power-off time must shift forward, or our scheduled work by the on-site must shift from 17:00 UTC to later in the day, subject to traffic approval.

Timeline

2024-06-12 @ 15:00 : @BCornwall and @CDobbins depools esams to let traffic start routing to other DCs.
2024-06-12 @ 16:00 : @BCornwall and @CDobbins puts the CP text hosts into downtime in icinga and powers them off in advance of onsite hands.
2024-06-12 @ 17:00 : Remote hands begins work, unplugging the cp hosts 1 at a time, installing the PCIe NVMe SSD, and plugging back in the host to a fully accessible state before moving onto the next host. During this time, RobH will be online and will attempt to remotely connect to each host as they are replaced and confirm function. Estimate of roughly 2 hours for onsite hands to fully accomplish installation of 8 NVMe SSDs into 8 text cp hosts.
2024-06-12 @ 19:00 : Re-pooling of ulsfo

Post-maintenance window: Reinstallation/reimage of text cp hosts as required by either RobH or @BCornwall and @CDobbins

Action checklist

  • Depool ulsfo DC and verify traffic is switched to other DCs (at least 2h before scheduled intervention)
  • Downtime impacted hosts to be ready for power off
    • Extra: silence/ack eventual other alerts
  • Power off impacted hosts
  • New SSD installation and hosts power on

Steps to be carried out after the disks are installed:

  • Verify host health and ATS cache (without using the new NVMe disk).
    • run puppet agent
    • Check that metrics are ok in grafana (Host Overview)
    • Check Icinga status (all green for the host)
    • Check with lsblk and nvme list that the new disk is visible and has the correct naming (nvme1n1)
    • Check ATS status (traffic_server -C check)
    • Check for "sure hit" on ATS: curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Main_Page -o /dev/null
    • Check for both first miss and subsequent hit (eg. issuing two requests like curl -vv -H 'Host: en.wikipedia.org' http://localhost:3128/wiki/Caffe1ne -o /dev/null and checking the X-Cache-Status value)
      • cp4037
      • cp4038
      • cp4039
      • cp4040
      • cp4041
      • cp4042
      • cp4043
      • cp4044
  • Remove downtime for hosts
  • Removed downtime from alertmanager
  • Manually repool hosts with conftool (auto depooled by PyBal?)
  • Repool ulsfo DC and verify the traffic
  • (in the next days) Merge the hiera config to add new disk, host by host, and depool/merge/reimage/repool hosts one by one with appropriate interval to help warm cache.
  • Remove custom hiera overrides and make it for whole ulsfo DC

Reimaging Process

  • Depool host you are going to work on
  • Merge patch in the patchset
  • Merge on Puppet master
  • Run the reimaging cookbook
  • Check if everything is fine after reimaging: Icinga, disks
  • Pool host back

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:39:19Z] <brett> Remove downtime of cache_text/cp text servers in ulsfo - T364891

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:51:50Z] <brett> Repool ulsfo as A:cp-text nvme upgrades are complete (T364891)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:52:00Z] <brett> authdns-update run on dns1004 (T364891)

BCornwall changed the task status from Open to In Progress.Wed, Jun 12, 8:01 PM
BCornwall claimed this task.
BCornwall removed a project: SRE.
BCornwall updated the task description. (Show Details)
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Change #1042366 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Set cp4037 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1042366

Change #1042366 merged by BCornwall:

[operations/puppet@production] Set cp4037 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1042366

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4037 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4037.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye completed:

  • cp4037 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406122144_brett_215232_cp4037.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4038 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4038.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4038 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4038.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Change #1043185 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Set cp4038 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043185

Change #1043185 merged by CDobbins:

[operations/puppet@production] Set cp4038 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043185

Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4038 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4038.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin2002 for host cp4038.ulsfo.wmnet with OS bullseye completed:

  • cp4038 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406132017_cdobbins_76216_cp4038.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
BCornwall updated the task description. (Show Details)
BCornwall added a subscriber: RobH.

Change #1043819 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Set cp4039 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043819

Change #1043819 merged by BCornwall:

[operations/puppet@production] Set cp4039 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043819

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4039 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4039.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye completed:

  • cp4039 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406141625_brett_284624_cp4039.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1043865 had a related patch set uploaded (by CDobbins; author: CDobbins):

[operations/puppet@production] Set cp4040 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043865

Change #1043865 merged by CDobbins:

[operations/puppet@production] Set cp4040 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043865

Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4040 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4040.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye completed:

  • cp4040 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406141952_cdobbins_2629714_cp4040.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1043888 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Set cp4041 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043888

Change #1043888 merged by BCornwall:

[operations/puppet@production] Set cp4041 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1043888

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4041 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4041.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye completed:

  • cp4041 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406142227_brett_343689_cp4041.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1046755 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Set cp4042 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1046755

Change #1046755 merged by BCornwall:

[operations/puppet@production] Set cp4042 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1046755

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4042 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4042.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4042.ulsfo.wmnet with OS bullseye completed:

  • cp4042 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406171940_brett_977356_cp4042.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1046778 had a related patch set uploaded (by CDobbins; author: CDobbins):

[operations/puppet@production] Set cp4043 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1046778

Change #1046778 merged by CDobbins:

[operations/puppet@production] Set cp4043 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1046778

Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4043 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4043.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4043.ulsfo.wmnet with OS bullseye completed:

  • cp4043 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406172252_cdobbins_3208261_cp4043.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1046797 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Set cp4044 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1046797

Change #1046797 merged by BCornwall:

[operations/puppet@production] Set cp4044 hieradata to use dual NVMe disks

https://gerrit.wikimedia.org/r/1046797

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye executed with errors:

  • cp4044 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cp4044.ulsfo.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye completed:

  • cp4044 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406180034_brett_1025033_cp4044.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1046804 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] hiera: Unify ulsfo trafficserver storage elements

https://gerrit.wikimedia.org/r/1046804

Change #1046804 merged by BCornwall:

[operations/puppet@production] hiera: Unify ulsfo trafficserver storage elements

https://gerrit.wikimedia.org/r/1046804

BCornwall updated the task description. (Show Details)