Page MenuHomePhabricator

Degraded RAID on cloudvirt1014
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host cloudvirt1014. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5, 2I:1:6 - Controller: OK - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P440ar in Slot 0 (Embedded)

   array A

      Logical Drive: 1
         Size: 2.9 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 512 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: / 85.7 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

Event Timeline

This seems to refer to missing battery. Disks seem to be OK, right?

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Dec 28 2019, 9:05 PM
herron triaged this task as High priority.Jan 3 2020, 7:44 PM
aborrero added subscribers: Andrew, Bstorm, Aklapper.

@Jclark-ctr I believe this server may need the BBU checked/replaced, but I may be wrong.

BTW this server has active workloads (pooled) at the moment. Please @Jclark-ctr coordinate with WMCS before shutting server down.

@aborrero It is out of warranty i do have a spare bbu and can replace it. @JHedden and i had spoken briefly regarding this one last night. I am on site now and if you would like to change it today I can. But it is friday before allhands. Would like to change it on return unless you think this is urgent.

Change 572072 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: depool cloudvirt1022 and cloudvirt1014

https://gerrit.wikimedia.org/r/572072

Change 572072 merged by Andrew Bogott:
[operations/puppet@production] nova: depool cloudvirt1022 and cloudvirt1014

https://gerrit.wikimedia.org/r/572072

This host is now drained and ready for maintenance.

Change 575094 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] nova: add cloudvirt1014 to scheduler pool

https://gerrit.wikimedia.org/r/575094

Change 575094 merged by Jhedden:
[operations/puppet@production] nova: add cloudvirt1014 to scheduler pool

https://gerrit.wikimedia.org/r/575094

JHedden closed this task as Resolved.Feb 26 2020, 10:32 PM

Cloudvirt1014 is fixed and back online. Thanks @Jclark-ctr !

replaced failed bbu