Page MenuHomePhabricator

Degraded RAID on labstore1007
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labstore1007. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 1: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK --- Slot 3: bad transfer speed: 1E:2:1(12.0Gbps, Unknown), 1E:2:2(12.0Gbps, Unknown), 1E:2:3(12.0Gbps, Unknown), 1E:2:4(12.0Gbps, Unknown), 1E:2:5(12.0Gbps, Unknown), 1E:2:6(12.0Gbps, Unknown), 1E:2:7(12.0Gbps, Unknown), 1E:2:8(12.0Gbps, Unknown), 1E:2:9(12.0Gbps, Unknown), 1E:2:10(12.0Gbps, Unknown), 1E:2:11(12.0Gbps, Unknown), 1E:2:12(12.0Gbps, Unknown) - OK: 1E:1:1, 1E:1:3, 1E:1:5, 1E:1:7, 1E:1:9, 1E:1:11, 1E:2:1, 1E:2:2, 1E:2:3, 1E:2:4, 1E:2:5, 1E:2:6, 1E:2:7, 1E:2:8, 1E:2:9, 1E:2:10, 1E:2:11, 1E:2:12 - Failed: 1E:1:2, 1E:1:4, 1E:1:6, 1E:1:8, 1E:1:10, 1E:1:12 - Controller: OK - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Smart Array P441 in Slot 3

   array A

      Logical Drive: 1
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Failed
         MultiDomain Status: OK
         Caching:  Enabled
         Mirror Group 1:
            physicaldrive 1E:1:1 (port 1E:box 1:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:2 (port 1E:box 1:bay 2, SATA, 6001.1 GB, Failed)
            physicaldrive 1E:1:3 (port 1E:box 1:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:4 (port 1E:box 1:bay 4, SATA, 6001.1 GB, Failed)
            physicaldrive 1E:1:5 (port 1E:box 1:bay 5, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:6 (port 1E:box 1:bay 6, SATA, 6001.1 GB, Failed)
         Mirror Group 2:
            physicaldrive 1E:1:7 (port 1E:box 1:bay 7, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:8 (port 1E:box 1:bay 8, SATA, 6001.1 GB, Failed)
            physicaldrive 1E:1:9 (port 1E:box 1:bay 9, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:10 (port 1E:box 1:bay 10, SATA, 6001.1 GB, Failed)
            physicaldrive 1E:1:11 (port 1E:box 1:bay 11, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:12 (port 1E:box 1:bay 12, SATA, 6001.1 GB, Failed)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 931.5 GB
         Fault Tolerance: 1
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sda 
         Mount Points: /boot 953 MB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 2I:4:1 (port 2I:box 4:bay 1, SATA, 1 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:4:2 (port 2I:box 4:bay 2, SATA, 1 TB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array B

      Logical Drive: 2
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdb 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SATA, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

chasemp assigned this task to Cmjohnson.Jun 28 2018, 3:14 PM
chasemp added subscribers: Bstorm, ArielGlenn, RobH and 2 others.

I don't quite understand this. Is this trying to say 6 failed drives?

chasemp triaged this task as High priority.Jun 28 2018, 3:15 PM

and labstore1006 as well? [from irc]

PROBLEM - Device not healthy -SMART- on labstore1006 is CRITICAL: cluster=misc device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1006:9100 job=node site=eqiad

Something seems off here.

chasemp added a subscriber: Volans.Jun 28 2018, 3:16 PM

@Volans can you help make sense of this?

Cmjohnson closed this task as Resolved.Jul 2 2018, 4:14 PM

The issue resulted from a disk shelf being added incorrectly. This has been fixed.