Page MenuHomePhabricator

Degraded RAID on ms-be2035
Closed, DeclinedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host ms-be2035. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P840 in Slot 3

   array A

      Logical Drive: 1
         Size: 447.1 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: /srv/swift-storage/sda4 297.2 GB Partition Number 5, /srv/swift-storage/sda3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array B

      Logical Drive: 2
         Size: 447.1 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sdb 
         Mount Points: /srv/swift-storage/sdb4 297.2 GB Partition Number 5, /srv/swift-storage/sdb3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array C

      Logical Drive: 3
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: /srv/swift-storage/sdc1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array D

      Logical Drive: 4
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: /srv/swift-storage/sdd1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array E

      Logical Drive: 5
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sde 
         Mount Points: /srv/swift-storage/sde1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array F

      Logical Drive: 6
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdf 
         Mount Points: /srv/swift-storage/sdf1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array G

      Logical Drive: 7
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdg 
         Mount Points: /srv/swift-storage/sdg1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array H

      Logical Drive: 8
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdh 
         Mount Points: /srv/swift-storage/sdh1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array I

      Logical Drive: 9
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdi 
         Mount Points: /srv/swift-storage/sdi1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array J

      Logical Drive: 10
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdj 
         Mount Points: /srv/swift-storage/sdj1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array K

      Logical Drive: 11
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdk 
         Mount Points: /srv/swift-storage/sdk1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array L

      Logical Drive: 12
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdl 
         Mount Points: /srv/swift-storage/sdl1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array M

      Logical Drive: 13
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdm 
         Mount Points: /srv/swift-storage/sdm1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array N

      Logical Drive: 14
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdn 
         Mount Points: /srv/swift-storage/sdn1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

@fgiunchedi can you please take a look at this alert i see only Smart Storage Battery failed and no disk failed.

Thanks

Hi @Papaul, it looks like it might be the battery indeed, I'll let @MatthewVernon check/confirm

Yep, it's the battery (like ms-be2032).

Please power down this server so i can disconnect the battery and connect it back . Note server is out of warranty.

@fgiunchedi @MatthewVernon this server and ms-be2032 were fresh last fiscal year with https://phabricator.wikimedia.org/T285809. Any reason we still that them in production?

Thanks

We do indeed have T294549 open to take this node out of production. Unfortunately, to do so we need to drain them out of the swift rings.

For that process to proceed requires the cluster to be in a healthy state (otherwise we risk availability or data loss issues); there have been so many hardware failures in swift nodes in the recent past that this process is taking rather a long time (the config change to start that process went in on 27 July).

This comment was removed by Papaul.
This comment was removed by Papaul.
Papaul triaged this task as Medium priority.Aug 25 2022, 4:04 PM

@MatthewVernon Did you decide on what you are going to do with this node?

It's still being drained - as per my note last month it's taking a while...

Papaul lowered the priority of this task from Medium to Lowest.Sep 26 2022, 6:03 PM

There is a decommission task for this node @T318689 to declining this task