Page MenuHomePhabricator

Reimage aqs1013
Closed, ResolvedPublic

Description

Having been unable to sort out recurring SSD failures (T362033), we're forced to try a complete re-image (w/o preserving the data partitions).

Steps:

  1. decommission (both instances)
  2. reimage
  3. bootstrap (both instances)

Related Objects

Event Timeline

Eevans triaged this task as High priority.May 7 2024, 6:37 PM
Eevans created this task.

Mentioned in SAL (#wikimedia-operations) [2024-05-07T18:40:10Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422

Icinga downtime and Alertmanager silence (ID=397ec6a2-88d3-4fa3-b149-367bc8b4c353) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommissioning — T364422

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-05-07T18:40:24Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422

Change #1029206 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] Reimage aqs1013 w/o preserving data

https://gerrit.wikimedia.org/r/1029206

Change #1029206 merged by Eevans:

[operations/puppet@production] Reimage aqs1013 w/o preserving data

https://gerrit.wikimedia.org/r/1029206

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye executed with errors:

  • aqs1013 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" aqs1013.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye completed:

  • aqs1013 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405091457_eevans_1745024_aqs1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-05-09T15:31:04Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422

Icinga downtime and Alertmanager silence (ID=e110d57c-bacd-48ee-8333-fae55b264d8c) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T364422

aqs1013.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2024-05-09T15:31:18Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422

The reimage is complete, and both instances have been bootstrapped. Closing.