Page MenuHomePhabricator

Degraded RAID on ms-be1032
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host ms-be1032. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:1 - Controller: OK - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P840 in Slot 3

   array A

      Logical Drive: 1
         Size: 447.1 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: /srv/swift-storage/sda4 297.2 GB Partition Number 5, /srv/swift-storage/sda3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array B

      Logical Drive: 2
         Size: 447.1 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sdb 
         Mount Points: /srv/swift-storage/sdb4 297.2 GB Partition Number 5, /srv/swift-storage/sdb3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array C

      Logical Drive: 3
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: /srv/swift-storage/sdc1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array D

      Logical Drive: 4
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: /srv/swift-storage/sdd1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array E

      Logical Drive: 5
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sde 
         Mount Points: /srv/swift-storage/sde1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array F

      Logical Drive: 6
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdf 
         Mount Points: /srv/swift-storage/sdf1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array G

      Logical Drive: 7
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdg 
         Mount Points: /srv/swift-storage/sdg1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array H

      Logical Drive: 8
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdh 
         Mount Points: /srv/swift-storage/sdh1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array I

      Logical Drive: 9
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdi 
         Mount Points: /srv/swift-storage/sdi1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array J

      Logical Drive: 10
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdj 
         Mount Points: /srv/swift-storage/sdj1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array K

      Logical Drive: 11
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: Failed
         MultiDomain Status: OK
         Caching:  Enabled
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array L

      Logical Drive: 12
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdl 
         Mount Points: /srv/swift-storage/sdl1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array M

      Logical Drive: 13
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdm 
         Mount Points: /srv/swift-storage/sdm1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array N

      Logical Drive: 14
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdn 
         Mount Points: /srv/swift-storage/sdn1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

@Cmjohnson @Jclark-ctr host is OOW, please replace the 4TB drive (led should be blinking)

jcrespo triaged this task as Medium priority.Jan 18 2021, 5:52 PM
jcrespo added a subscriber: jcrespo.

Assigning medium status to remove from SRE untriaged inbox, feel free to edit on disagreement.

This server is out of warranty, I do not have any HP 4TB disks to replace it with. I may have some old Dell ones I can use. They should work

I did swap it with a 4TB disk from a Dell server. Hopefully, this works.

Thank you @Cmjohnson ! Doesn't look like the host likes the new disk :(

Once ms-be1046 is repaired in T272396 I'll start decom of one host so there will be spare HP 4TB drives.

=> ld 11 modify reenable

Warning: Any previously existing data on the logical drive may not be valid or 
         recoverable. Continue? (y/n) y


Error: This operation is not supported with the current configuration. Use the 
       "show" command on devices to show additional details about the
       configuration.
Reason: Array status not ok

=> pd 2I:2:1 show detail

Smart Array P840 in Slot 3

   array K

      physicaldrive 2I:2:1
         Port: 2I
         Box: 2
         Bay: 1
         Status: Failed
         Last Failure Reason: Hot plug replacement too small
         Drive Type: Data Drive
         Interface Type: SATA
         Size: 4.1 GB
         Drive exposed to OS: False
         Native Block Size: 512
         Rotational Speed: 7200
         Firmware Revision: GA0A
         Serial Number: Z1Z3RK2Z
         Model: ATA     ST4000NM0033
         SATA NCQ Capable: True
         SATA NCQ Enabled: True
         Maximum Temperature (C): 38
         PHY Count: 1
         PHY Transfer Rate: 3.0Gbps
         Drive Authentication Status: Not Applicable
         Sanitize Erase Supported: False

@fgiunchedi I am resolving this but please open a decom task when you're ready to decommission this server. Thanks

@fgiunchedi I am resolving this but please open a decom task when you're ready to decommission this server. Thanks

[reopening]

This host isn't going to be decom anytime soon. However I am decom'ing hosts in T272836 which we can then use for spare parts to replace this disk, hope that makes sense!

Alternatively I'm ok to order a new disk as well if you'd rather not wait for other decoms

@fgiunchedi with ms-be1034 going down and out, I can use a disk from that server to fix this issue. Let me know if you want to do that?

@fgiunchedi with ms-be1034 going down and out, I can use a disk from that server to fix this issue. Let me know if you want to do that?

Yes please, let's do that. cc @Jclark-ctr since he's handled ms-be1034's disks

Hi @fgiunchedi - let us know when you have the decom task for ms-be1034 submitted per our conversation on IRC....then we can pull one of the drives for this. Thanks, Willy

@fgiunchedi with ms-be1034 going down and out, I can use a disk from that server to fix this issue. Let me know if you want to do that?

Yes please, let's do that. cc @Jclark-ctr since he's handled ms-be1034's disks

Hi @fgiunchedi - let us know when you have the decom task for ms-be1034 submitted per our conversation on IRC....then we can pull one of the drives for this. Thanks, Willy

Decom task is T276522, the host is already OOS for practical purposes (it is down after all). I ran into a snag on the decom cookbook (T276524) although if we could replace the disk here with one of ms-be1034's (or ms-be1017 now? as per https://phabricator.wikimedia.org/T274488#6850819) already that'd be appreciated.

Replaced the failed disk with one from decom servers ms-be1017