Degraded RAID on ms-be2021
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host ms-be2021. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Smart Array P840 in Slot 3

   array A

      Logical Drive: 1
         Size: 279.4 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: /srv/swift-storage/sda4 129.5 GB Partition Number 5, /srv/swift-storage/sda3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array B

      Logical Drive: 2
         Size: 279.4 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Disabled
         Disk Name: /dev/sdb 
         Mount Points: /srv/swift-storage/sdb4 129.5 GB Partition Number 5, /srv/swift-storage/sdb3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array C

      Logical Drive: 3
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: /srv/swift-storage/sdc1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array D

      Logical Drive: 4
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: /srv/swift-storage/sdd1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array E

      Logical Drive: 5
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sde 
         Mount Points: /srv/swift-storage/sde1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array F

      Logical Drive: 6
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdf 
         Mount Points: /srv/swift-storage/sdf1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array G

      Logical Drive: 7
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdg 
         Mount Points: /srv/swift-storage/sdg1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array H

      Logical Drive: 8
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdh 
         Mount Points: /srv/swift-storage/sdh1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array I

      Logical Drive: 9
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdi 
         Mount Points: /srv/swift-storage/sdi1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array J

      Logical Drive: 10
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdj 
         Mount Points: /srv/swift-storage/sdj1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array K

      Logical Drive: 11
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdk 
         Mount Points: /srv/swift-storage/sdk1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array L

      Logical Drive: 12
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdl 
         Mount Points: /srv/swift-storage/sdl1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array M

      Logical Drive: 13
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdm 
         Mount Points: /srv/swift-storage/sdm1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array N

      Logical Drive: 14
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdn 
         Mount Points: /srv/swift-storage/sdn1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2018, 8:57 PM
fgiunchedi assigned this task to Papaul.Oct 29 2018, 7:53 AM
fgiunchedi added subscribers: Papaul, fgiunchedi.

Hi @Papaul,
looks like this controller has a faulty battery, likely will need to have the battery replaced, what do you think? We have seen this issue before on a bunch of ms-be hosts in eqiad: https://phabricator.wikimedia.org/search/query/DukvNR_2zFog/#R

Papaul triaged this task as Normal priority.Oct 29 2018, 2:36 PM

@fgiunchedi we can try to drain the power, unplug and plug back the controller cable and update the server firmware as well. Done this on some DB servers. Let me know what you think .

Mentioned in SAL (#wikimedia-operations) [2018-10-29T15:48:13Z] <godog> power off ms-be2021 for controller alarms troubleshooting - T208096

Papaul reassigned this task from Papaul to fgiunchedi.Oct 29 2018, 5:09 PM

@fgiunchedi battery is dead, it needs to be replaced. I update the firmware. Server is back up.

fgiunchedi reassigned this task from fgiunchedi to Papaul.Oct 29 2018, 5:16 PM

Thanks @Papaul, the host is back up. Please proceed with ordering a replacement battery and let me know when ready to swap.

Papaul mentioned this in Unknown Object (Task).Oct 29 2018, 9:21 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-22T21:08:38Z] <godog> disable raid handler for ms-be2021 - T208096

Papaul reassigned this task from Papaul to fgiunchedi.Mon, Dec 3, 4:50 PM

@fgiunchedi All looks good on the system. You can resolve the task when finished double checking.

fgiunchedi closed this task as Resolved.Tue, Dec 4, 8:08 AM

LGTM on my side too, I've reenabled the event handler.