Page MenuHomePhabricator

Degraded RAID on ms-be1024
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host ms-be1024. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P840 in Slot 3

   array A

      Logical Drive: 1
         Size: 186.3 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: /srv/swift-storage/sda4 36.3 GB Partition Number 5, /srv/swift-storage/sda3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array B

      Logical Drive: 2
         Size: 186.3 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sdb 
         Mount Points: /srv/swift-storage/sdb4 36.3 GB Partition Number 5, /srv/swift-storage/sdb3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array C

      Logical Drive: 3
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: /srv/swift-storage/sdc1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array D

      Logical Drive: 4
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: /srv/swift-storage/sdd1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array E

      Logical Drive: 5
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sde 
         Mount Points: /srv/swift-storage/sde1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array F

      Logical Drive: 6
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdf 
         Mount Points: /srv/swift-storage/sdf1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array G

      Logical Drive: 7
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdg 
         Mount Points: /srv/swift-storage/sdg1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array H

      Logical Drive: 8
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdh 
         Mount Points: /srv/swift-storage/sdh1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array I

      Logical Drive: 9
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdi 
         Mount Points: /srv/swift-storage/sdi1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array J

      Logical Drive: 10
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdj 
         Mount Points: /srv/swift-storage/sdj1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array K

      Logical Drive: 11
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdk 
         Mount Points: /srv/swift-storage/sdk1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array L

      Logical Drive: 12
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdl 
         Mount Points: /srv/swift-storage/sdl1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array M

      Logical Drive: 13
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdm 
         Mount Points: /srv/swift-storage/sdm1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array N

      Logical Drive: 14
         Size: 2.7 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdn 
         Mount Points: /srv/swift-storage/sdn1 2.7 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

Restricted Application added a subscriber: Marostegui. · View Herald TranscriptTue, Jul 14, 6:26 PM
wiki_willy added subscribers: Cmjohnson, wiki_willy.

@Cmjohnson - looks like this one is out of warranty. (purchased in June 2016) Let me know if you have any spares onsite or if you need a part ordered. Thanks, Willy

@wiki_willy checked storage we do not have any 3tb drives

Thanks @Jclark-ctr. Hi @Marostegui or @fgiunchedi - since it looks like we'll be testing out some new hardware this quarter and eventually refreshing this via T252216, let me know if you'd like us to proceed with purchasing a new 3tb drive for ms-be1024 or if we should hold off for the refresh. Thanks, Willy

Hey @wiki_willy I'm not a service owner for this kind of host :-)

This appears to be a BBU fault:

Cache Status: Permanently Disabled
Cache Status Details: Cache disabled; battery/capacitor is not attached

Do we have a BBU on site and/or can order a bunch ? I'm generally for replacing small/cheap parts until the refresh if at all doable, thanks!

@Jclark-ctr - it doesn't appear to be a drive issue. Do you have any more BBUs around?

This appears to be a BBU fault:

Cache Status: Permanently Disabled
Cache Status Details: Cache disabled; battery/capacitor is not attached

Do we have a BBU on site and/or can order a bunch ? I'm generally for replacing small/cheap parts until the refresh if at all doable, thanks!

@wiki_willy @fgiunchedi we do have 3 bbu remaining on site

@fgiunchedi if you would like bbu swapped i am available tomorrow morning eastern time

@fgiunchedi if you would like bbu swapped i am available tomorrow morning eastern time

For sure, LMK on IRC or here when available

Mentioned in SAL (#wikimedia-operations) [2020-07-21T15:09:04Z] <godog> poweroff ms-be1024 for bbu replacement - T257949

finished with bbu swap

fgiunchedi closed this task as Resolved.Wed, Jul 22, 7:17 AM

All good, thanks @Jclark-ctr

Cache Board Present: True
Cache Status: OK
Cache Ratio: 10% Read / 90% Write