I don't want to keep polluting the main setup task with all my setup failures for the hw raid testing, so using this sub task.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | MoritzMuehlenhoff | T297913 Confirm support of PERC 750 raid controller | |||
Unknown Object (Task) | |||||
Open | MoritzMuehlenhoff | T308027 private repo deployment - perccli implementation | |||
Resolved | ArielGlenn | T299443 Q3: rack/setup/install dumpsdata100[67] | |||
Resolved | RobH | T302937 datadumps1007 test installs | |||
Resolved | RobH | T302940 icinga config error for new rows E/R |
Event Timeline
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed:
- dumpsdata1007 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203022321_robh_12658_dumpsdata1007.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
so this is installed now with hwraid1 single disk setup just to see if it even works within the OS.
When I then launch the OS, it loads, but any megacli commands hang it.
15:52 mutante:
robh: icinga config does not like dumpsdata1007 right now ..because of: Error: 'lsw1-f1-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'dumpsdata1007'
cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: dumpsdata1007.eqiad.wmnet
- dumpsdata1007.eqiad.wmnet (WARN)
- Host not found on Icinga, unable to downtime it
- Found physical host
- Management interface not found on Icinga, unable to downtime it
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- COMMON_STEPS (FAIL)
- Failed to run Homer on lsw1-f1-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'lsw1-f1-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - robh@cumin1001 - T302937']' returned non-zero exit status 1.
ERROR: some step on some host failed, check the bolded items above
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed:
- dumpsdata1007 (PASS)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203031839_robh_18601_dumpsdata1007.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> staged
Ok, this is now installed. However, I have a single raid1 of the 2 SSDS, but the megacli app doesn't see this?
It can read controller info though, so its an inconsistent feedback.
robh@dumpsdata1007:~$ sudo megacli -LDInfo -Lall -aALL Adapter 0 -- Virtual Drive Information: Adapter 0: No Virtual Drive Configured. Exit Code: 0x00 robh@dumpsdata1007:~$ sudo megacli -PDList -aALL Adapter #0 Enclosure Device ID: 64 Slot Number: 0 Enclosure position: 1 Device Id: 1 WWN: 5000039b18e1ee0d Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c0 Connected Port Number: 8(path0) Inquiry Data: 81U0A1ERF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 1 Enclosure position: 1 Device Id: 2 WWN: 5000039b18e1ecaf Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c1 Connected Port Number: 8(path0) Inquiry Data: 81U0A1C4F1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :27C (80.60 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 2 Enclosure position: 1 Device Id: 6 WWN: 5000039b18e1ee17 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c2 Connected Port Number: 8(path0) Inquiry Data: 81U0A1EVF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :27C (80.60 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 3 Enclosure position: 1 Device Id: 7 WWN: 5000039b18e1ef13 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c3 Connected Port Number: 8(path0) Inquiry Data: 81U0A1GMF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 4 Enclosure position: 1 Device Id: 5 WWN: 5000039b18e27f3f Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c4 Connected Port Number: 8(path0) Inquiry Data: 81V0A03GF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 5 Enclosure position: 1 Device Id: 9 WWN: 5000039b18e1eeef Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c5 Connected Port Number: 8(path0) Inquiry Data: 81U0A1GDF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :27C (80.60 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 6 Enclosure position: 1 Device Id: 3 WWN: 5000039b18e298cf Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c6 Connected Port Number: 8(path0) Inquiry Data: 81V0A0FRF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 7 Enclosure position: 1 Device Id: 12 WWN: 5000039b18e1c6fb Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c7 Connected Port Number: 8(path0) Inquiry Data: 81T0A0QSF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :29C (84.20 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 8 Enclosure position: 1 Device Id: 8 WWN: 5000039b18e1f788 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c8 Connected Port Number: 8(path0) Inquiry Data: 81U0A1LUF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 9 Enclosure position: 1 Device Id: 11 WWN: 5000039b18e1c703 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1c9 Connected Port Number: 8(path0) Inquiry Data: 81T0A0QTF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 10 Enclosure position: 1 Device Id: 10 WWN: 5000039b18e1f736 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1ca Connected Port Number: 8(path0) Inquiry Data: 81U0A1LHF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 11 Enclosure position: 1 Device Id: 13 WWN: 5000039b18e1ed07 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 7.277 TB [0x3a3812ab0 Sectors] Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors] Coerced Size: 7.276 TB [0x3a3700000 Sectors] Sector Size: 512 Firmware state: Unconfigured(good), Spun Up Device Firmware Level: GA09 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1cb Connected Port Number: 8(path0) Inquiry Data: 81U0A1CNF1QFTOSHIBA MG06ACA800EY GA09 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :28C (82.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 12 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5ace42e0253abda0 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 447.130 GB [0x37e436b0 Sectors] Non Coerced Size: 446.630 GB [0x37d436b0 Sectors] Coerced Size: 446.625 GB [0x37d40000 Sectors] Sector Size: 512 Firmware state: Online, Spun Up Device Firmware Level: DZ00 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1cc Connected Port Number: 8(path0) Inquiry Data: BNA9N7194I090A461HFS480G3H2X069N DZ00 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Solid State Device Drive Temperature :41C (105.80 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 64 Slot Number: 13 Drive's position: DiskGroup: 0, Span: 0, Arm: 1 Enclosure position: 1 Device Id: 4 WWN: 5ace42e0253abda1 Sequence Number: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 447.130 GB [0x37e436b0 Sectors] Non Coerced Size: 446.630 GB [0x37d436b0 Sectors] Coerced Size: 446.625 GB [0x37d40000 Sectors] Sector Size: 512 Firmware state: Online, Spun Up Device Firmware Level: DZ00 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x500056b327bfa1cd Connected Port Number: 8(path0) Inquiry Data: BNA9N7194I090A462HFS480G3H2X069N DZ00 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Solid State Device Drive Temperature :43C (109.40 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Exit Code: 0x00 robh@dumpsdata1007:~$
Change 771442 had a related patch set uploaded (by RobH; author: RobH):
[operations/puppet@production] dumpsdata1006 setup info
Change 771442 merged by RobH:
[operations/puppet@production] dumpsdata1006 setup info
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Change 771679 had a related patch set uploaded (by RobH; author: RobH):
[operations/puppet@production] testing new partman recipe for h750
Change 771679 merged by RobH:
[operations/puppet@production] testing new partman recipe for h750
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Change 808047 had a related patch set uploaded (by RobH; author: RobH):
[operations/puppet@production] dumpsdata100[67] partman testing
Change 808047 merged by RobH:
[operations/puppet@production] dumpsdata100[67] partman testing
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run failed, asking the operator what to do
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Change 808060 had a related patch set uploaded (by RobH; author: RobH):
[operations/puppet@production] adding lvm wipe for recipe
Change 808060 merged by RobH:
[operations/puppet@production] adding lvm wipe for recipe
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
So when attempting to run the updated partman recipe I get the following:
│ Unable to automatically remove LVM data │ │ Because the volume group(s) on the selected device also consist of │ │ physical volumes on other devices, it is not considered safe to │ │ remove its LVM data automatically. If you wish to use this device for │ │ partitioning, please remove its LVM data first. │
I've then attempted to drop to shell for dd input, and didn't fix:
dd if=/dev/zero of=/dev/sda bs=512 count=1 dd if=/dev/zero of=/dev/sdb bs=512 count=1
I also appended all the LVM wipe options I know to the recipe:
# remove any LVM already on the disks d-i partman-lvm/device_remove_lvm boolean true d-i partman-lvm/confirm boolean true d-i partman-lvm/confirm_nooverwrite boolean true
and still get that unable to auto remove lvm data. I also broke down and rebuild the raid array and did a quick init, but the lvm data is still there, so no luck. I could run a full multi-hour init to hopefully remove it, but there should be a flag around this?
What other flags should I be using to get the old lvm data off?
I didn't attempt vgreduce or vgremove in shell since that cannot be automated (vg names and the like) so its not really feasible for scale. If no one can assist then that will be my next steps.
As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/808870 and thus the cc).
I don't know if it is helpful at this point, however if wipefs is available you can:
wipefs -a /dev/<device>
To remove the LVM metadata, alternatively obliterating the first few megabytes should do it too:
dd if=/dev/zero of=/dev/sda bs=1M count=50
Change 809641 had a related patch set uploaded (by RobH; author: RobH):
[operations/puppet@production] testing h750 recipes
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Ok, I tried this and I get the same issue, says LVM data exists like:
.I've even fully initialized and rebuilt the arrays, no luck. wipefs doesn't exist in our installer shell.
Since this is a one off leftover from the testing, just boot with a rescue disk and wipe it manually?
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1007 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291719_robh_795554_dumpsdata1007.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details
Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:
- dumpsdata1006 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- The reimage failed, see the cookbook logs for the details