Having been unable to sort out recurring SSD failures (T362033), we're forced to try a complete re-image (w/o preserving the data partitions).
Steps:
- decommission (both instances)
- reimage
- bootstrap (both instances)
Having been unable to sort out recurring SSD failures (T362033), we're forced to try a complete re-image (w/o preserving the data partitions).
Steps:
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Reimage aqs1013 w/o preserving data | operations/puppet | production | +3 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | Jclark-ctr | T362033 Degraded RAID on aqs1013 | |||
Resolved | Eevans | T364422 Reimage aqs1013 |
Mentioned in SAL (#wikimedia-operations) [2024-05-07T18:40:10Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422
Icinga downtime and Alertmanager silence (ID=397ec6a2-88d3-4fa3-b149-367bc8b4c353) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Decommissioning — T364422
aqs1013.eqiad.wmnet
Mentioned in SAL (#wikimedia-operations) [2024-05-07T18:40:24Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Decommissioning — T364422
Change #1029206 had a related patch set uploaded (by Eevans; author: Eevans):
[operations/puppet@production] Reimage aqs1013 w/o preserving data
Change #1029206 merged by Eevans:
[operations/puppet@production] Reimage aqs1013 w/o preserving data
Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye executed with errors:
Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host aqs1013.eqiad.wmnet with OS bullseye completed:
Mentioned in SAL (#wikimedia-operations) [2024-05-09T15:31:04Z] <eevans@cumin1002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422
Icinga downtime and Alertmanager silence (ID=e110d57c-bacd-48ee-8333-fae55b264d8c) set by eevans@cumin1002 for 30 days, 0:00:00 on 1 host(s) and their services with reason: Bootstrapping — T364422
aqs1013.eqiad.wmnet
Mentioned in SAL (#wikimedia-operations) [2024-05-09T15:31:18Z] <eevans@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on aqs1013.eqiad.wmnet with reason: Bootstrapping — T364422