Page MenuHomePhabricator

PROBLEM alert - cloudvirt1023/MegaRAID is CRITICAL
Closed, DuplicatePublic

Description

CRITICAL: 1 failed LD(s) (Degraded)

This is largely harmless, since we are no longer using the ssd raid in that host for storage. We can either just ignore this alert forever, or rebuild the hw raid to exclude the broken drive.

Event Timeline

Mentioned in SAL (#wikimedia-cloud-feed) [2022-09-30T14:15:16Z] <wm-bot2> Safe rebooting 'cloudvirt1023.eqiad.wmnet'. (T319025) - cookbook ran by andrew@buster

Mentioned in SAL (#wikimedia-cloud-feed) [2022-09-30T14:15:21Z] <wm-bot2> Draining 'cloudvirt1023.eqiad.wmnet'. (T319025) - cookbook ran by andrew@buster

Mentioned in SAL (#wikimedia-cloud-feed) [2022-09-30T14:15:58Z] <wm-bot2> Set cloudvirt 'cloudvirt1023.eqiad.wmnet' maintenance (downtime id: 64eac5c6-4b1d-4269-98fd-8e5bed42ce40, use this to unset). (T319025) - cookbook ran by andrew@buster

Mentioned in SAL (#wikimedia-cloud-feed) [2022-09-30T14:16:01Z] <wm-bot2> Drained 'cloudvirt1023.eqiad.wmnet'. (T319025) - cookbook ran by andrew@buster

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1023 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1023.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1023 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details