Page MenuHomePhabricator

datadumps1007 test installs
Open, Needs TriagePublic

Description

I don't want to keep polluting the main setup task with all my setup failures for the hw raid testing, so using this sub task.

Event Timeline

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed:

  • dumpsdata1007 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203022321_robh_12658_dumpsdata1007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

so this is installed now with hwraid1 single disk setup just to see if it even works within the OS.

When I then launch the OS, it loads, but any megacli commands hang it.

15:52 mutante:
robh: icinga config does not like dumpsdata1007 right now ..because of: Error: 'lsw1-f1-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host 'dumpsdata1007'

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: dumpsdata1007.eqiad.wmnet

  • dumpsdata1007.eqiad.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run Homer on lsw1-f1-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'lsw1-f1-eqiad.mgmt.eqiad.wmnet', 'commit', 'Host decommission - robh@cumin1001 - T302937']' returned non-zero exit status 1.

ERROR: some step on some host failed, check the bolded items above

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed:

  • dumpsdata1007 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203031839_robh_18601_dumpsdata1007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Ok, this is now installed. However, I have a single raid1 of the 2 SSDS, but the megacli app doesn't see this?

It can read controller info though, so its an inconsistent feedback.

robh@dumpsdata1007:~$ sudo megacli -LDInfo -Lall -aALL
                                     

Adapter 0 -- Virtual Drive Information:
Adapter 0: No Virtual Drive Configured.

Exit Code: 0x00
robh@dumpsdata1007:~$ sudo megacli -PDList -aALL
                                     
Adapter #0

Enclosure Device ID: 64
Slot Number: 0
Enclosure position: 1
Device Id: 1
WWN: 5000039b18e1ee0d
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c0
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1ERF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 1
Enclosure position: 1
Device Id: 2
WWN: 5000039b18e1ecaf
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c1
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1C4F1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :27C (80.60 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 2
Enclosure position: 1
Device Id: 6
WWN: 5000039b18e1ee17
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c2
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1EVF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :27C (80.60 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 3
Enclosure position: 1
Device Id: 7
WWN: 5000039b18e1ef13
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c3
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1GMF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 4
Enclosure position: 1
Device Id: 5
WWN: 5000039b18e27f3f
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c4
Connected Port Number: 8(path0) 
Inquiry Data:         81V0A03GF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 5
Enclosure position: 1
Device Id: 9
WWN: 5000039b18e1eeef
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c5
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1GDF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :27C (80.60 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 6
Enclosure position: 1
Device Id: 3
WWN: 5000039b18e298cf
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c6
Connected Port Number: 8(path0) 
Inquiry Data:         81V0A0FRF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 7
Enclosure position: 1
Device Id: 12
WWN: 5000039b18e1c6fb
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c7
Connected Port Number: 8(path0) 
Inquiry Data:         81T0A0QSF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :29C (84.20 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 8
Enclosure position: 1
Device Id: 8
WWN: 5000039b18e1f788
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c8
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1LUF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 9
Enclosure position: 1
Device Id: 11
WWN: 5000039b18e1c703
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1c9
Connected Port Number: 8(path0) 
Inquiry Data:         81T0A0QTF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 10
Enclosure position: 1
Device Id: 10
WWN: 5000039b18e1f736
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1ca
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1LHF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 11
Enclosure position: 1
Device Id: 13
WWN: 5000039b18e1ed07
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 7.277 TB [0x3a3812ab0 Sectors]
Non Coerced Size: 7.276 TB [0x3a3712ab0 Sectors]
Coerced Size: 7.276 TB [0x3a3700000 Sectors]
Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: GA09
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1cb
Connected Port Number: 8(path0) 
Inquiry Data:         81U0A1CNF1QFTOSHIBA MG06ACA800EY                        GA09
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :28C (82.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 12
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 0
WWN: 5ace42e0253abda0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 447.130 GB [0x37e436b0 Sectors]
Non Coerced Size: 446.630 GB [0x37d436b0 Sectors]
Coerced Size: 446.625 GB [0x37d40000 Sectors]
Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DZ00
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1cc
Connected Port Number: 8(path0) 
Inquiry Data:    BNA9N7194I090A461HFS480G3H2X069N                             DZ00
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Solid State Device
Drive Temperature :41C (105.80 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No



Enclosure Device ID: 64
Slot Number: 13
Drive's position: DiskGroup: 0, Span: 0, Arm: 1
Enclosure position: 1
Device Id: 4
WWN: 5ace42e0253abda1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 447.130 GB [0x37e436b0 Sectors]
Non Coerced Size: 446.630 GB [0x37d436b0 Sectors]
Coerced Size: 446.625 GB [0x37d40000 Sectors]
Sector Size:  512
Firmware state: Online, Spun Up
Device Firmware Level: DZ00
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b327bfa1cd
Connected Port Number: 8(path0) 
Inquiry Data:    BNA9N7194I090A462HFS480G3H2X069N                             DZ00
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Solid State Device
Drive Temperature :43C (109.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : Enabled
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No




Exit Code: 0x00
robh@dumpsdata1007:~$

Change 771442 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] dumpsdata1006 setup info

https://gerrit.wikimedia.org/r/771442

Change 771442 merged by RobH:

[operations/puppet@production] dumpsdata1006 setup info

https://gerrit.wikimedia.org/r/771442

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 771679 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] testing new partman recipe for h750

https://gerrit.wikimedia.org/r/771679

Change 771679 merged by RobH:

[operations/puppet@production] testing new partman recipe for h750

https://gerrit.wikimedia.org/r/771679

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1006.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 808047 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] dumpsdata100[67] partman testing

https://gerrit.wikimedia.org/r/808047

Change 808047 merged by RobH:

[operations/puppet@production] dumpsdata100[67] partman testing

https://gerrit.wikimedia.org/r/808047

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed, asking the operator what to do
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Change 808060 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] adding lvm wipe for recipe

https://gerrit.wikimedia.org/r/808060

Change 808060 merged by RobH:

[operations/puppet@production] adding lvm wipe for recipe

https://gerrit.wikimedia.org/r/808060

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

So when attempting to run the updated partman recipe I get the following:

Screen Shot 2022-06-23 at 12.38.40 PM.png (738×982 px, 90 KB)

│                Unable to automatically remove LVM data                │    
│ Because the volume group(s) on the selected device also consist of    │    
│ physical volumes on other devices, it is not considered safe to       │    
│ remove its LVM data automatically. If you wish to use this device for │    
│ partitioning, please remove its LVM data first.                       │

I've then attempted to drop to shell for dd input, and didn't fix:

dd if=/dev/zero of=/dev/sda bs=512 count=1
dd if=/dev/zero of=/dev/sdb bs=512 count=1

I also appended all the LVM wipe options I know to the recipe:

# remove any LVM already on the disks
d-i	partman-lvm/device_remove_lvm	boolean	true
d-i	partman-lvm/confirm		boolean	true
d-i	partman-lvm/confirm_nooverwrite	boolean true

and still get that unable to auto remove lvm data. I also broke down and rebuild the raid array and did a quick init, but the lvm data is still there, so no luck. I could run a full multi-hour init to hopefully remove it, but there should be a flag around this?

What other flags should I be using to get the old lvm data off?

I didn't attempt vgreduce or vgremove in shell since that cannot be automated (vg names and the like) so its not really feasible for scale. If no one can assist then that will be my next steps.

As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/808870 and thus the cc).

I don't know if it is helpful at this point, however if wipefs is available you can:

wipefs -a /dev/<device>

To remove the LVM metadata, alternatively obliterating the first few megabytes should do it too:

dd if=/dev/zero of=/dev/sda bs=1M count=50

Change 809641 had a related patch set uploaded (by RobH; author: RobH):

[operations/puppet@production] testing h750 recipes

https://gerrit.wikimedia.org/r/809641

Change 809641 merged by RobH:

[operations/puppet@production] testing h750 recipes

https://gerrit.wikimedia.org/r/809641

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/808870 and thus the cc).

I don't know if it is helpful at this point, however if wipefs is available you can:

wipefs -a /dev/<device>

To remove the LVM metadata, alternatively obliterating the first few megabytes should do it too:

dd if=/dev/zero of=/dev/sda bs=1M count=50

Ok, I tried this and I get the same issue, says LVM data exists like:

Screen Shot 2022-06-23 at 12.38.40 PM.png (738×982 px, 90 KB)
.

I've even fully initialized and rebuilt the arrays, no luck. wipefs doesn't exist in our installer shell.

As mentioned at the SRE meeting @BTullis is also looking into this for DSE hosts (review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/808870 and thus the cc).

I don't know if it is helpful at this point, however if wipefs is available you can:

wipefs -a /dev/<device>

To remove the LVM metadata, alternatively obliterating the first few megabytes should do it too:

dd if=/dev/zero of=/dev/sda bs=1M count=50

Ok, I tried this and I get the same issue, says LVM data exists like:

Screen Shot 2022-06-23 at 12.38.40 PM.png (738×982 px, 90 KB)
.

I've even fully initialized and rebuilt the arrays, no luck. wipefs doesn't exist in our installer shell.

Since this is a one off leftover from the testing, just boot with a rescue disk and wipe it manually?

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye executed with errors:

  • dumpsdata1007 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202206291719_robh_795554_dumpsdata1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details