Page MenuHomePhabricator

Degraded RAID on ms-be2019
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host ms-be2019. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P840 in Slot 3

   array A

      Logical Drive: 1
         Size: 279.4 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: /srv/swift-storage/sda4 129.5 GB Partition Number 5, /srv/swift-storage/sda3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array B

      Logical Drive: 2
         Size: 279.4 GB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sdb 
         Mount Points: /srv/swift-storage/sdb4 129.5 GB Partition Number 5, /srv/swift-storage/sdb3 93.1 GB Partition Number 4
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

   array C

      Logical Drive: 3
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: /srv/swift-storage/sdc1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array D

      Logical Drive: 4
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: /srv/swift-storage/sdd1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array E

      Logical Drive: 5
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sde 
         Mount Points: /srv/swift-storage/sde1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array F

      Logical Drive: 6
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdf 
         Mount Points: /srv/swift-storage/sdf1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array G

      Logical Drive: 7
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdg 
         Mount Points: /srv/swift-storage/sdg1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array H

      Logical Drive: 8
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdh 
         Mount Points: /srv/swift-storage/sdh1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array I

      Logical Drive: 9
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdi 
         Mount Points: /srv/swift-storage/sdi1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array J

      Logical Drive: 10
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdj 
         Mount Points: /srv/swift-storage/sdj1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array K

      Logical Drive: 11
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdk 
         Mount Points: /srv/swift-storage/sdk1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array L

      Logical Drive: 12
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdl 
         Mount Points: /srv/swift-storage/sdl1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array M

      Logical Drive: 13
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdm 
         Mount Points: /srv/swift-storage/sdm1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array N

      Logical Drive: 14
         Size: 3.6 TB
         Fault Tolerance: 0
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdn 
         Mount Points: /srv/swift-storage/sdn1 3.6 TB Partition Number 2
         OS Status: LOCKED
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

@Papaul looks like a BBU problem to me, can we order/install a new battery ? thanks!

Papaul triaged this task as Medium priority.Sep 9 2020, 4:17 PM

@wiki_willy this since is out of warranty since 2018 and i have no spare or decommissioned server onsite to pull the BBU from. I am requesting approval to purchase new BBU.

Thanks

Hi @Papaul - when I look at Netbox, it shows that ms-be2019 was purchased 5yrs ago.

https://netbox.wikimedia.org/dcim/devices/240/

@fgiunchedi - isn't this one going to be refreshed, as soon as you're done testing out the Dell 740xd2 server via T260188? I see the refresh initially budgeted for this in Q1.

Thanks,
Willy

Hi @Papaul - when I look at Netbox, it shows that ms-be2019 was purchased 5yrs ago.

https://netbox.wikimedia.org/dcim/devices/240/

@fgiunchedi - isn't this one going to be refreshed, as soon as you're done testing out the Dell 740xd2 server via T260188? I see the refresh initially budgeted for this in Q1.

That's correct yeah! Realistically we're a few weeks away from having the 740xd2 in production and this host ready for decom. FWIW I think it makes sense to go ahead with the BBU order for now

wiki_willy mentioned this in Unknown Object (Task).Sep 10 2020, 9:17 PM
wiki_willy added a subtask: Unknown Object (Task).

Thanks for the info @fgiunchedi. T262614 has been created to order the new part. Thanks, Willy

@fgiunchedi - isn't this one going to be refreshed, as soon as you're done testing out the Dell 740xd2 server via T260188? I see the refresh initially budgeted for this in Q1.

That's correct yeah! Realistically we're a few weeks away from having the 740xd2 in production and this host ready for decom. FWIW I think it makes sense to go ahead with the BBU order for now

Papaul closed subtask Unknown Object (Task) as Resolved.Sep 22 2020, 12:18 AM

Thanks @Papaul ! Just checked now and it seems the controller didn't like the new BBU (feel free to power down the host if you need to)

root@ms-be2019:~# hpssacli 'controller slot=3 show' | grep -i cache | grep -v Serial
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: Permanently Disabled
   Cache Status Details: Cache disabled; battery/capacitor is not attached
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.2 GB
   No-Battery Write Cache: Disabled
   Cache Module Temperature (C): 53

@fgiunchedi yes that is the reason i left the task open but forgot to comment.
Thanks

@fgiunchedi al good now upgrade the ILO firmware and reboot the server

Cache Board Present: True
   Cache Status: OK
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.2 GB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Cache Module Temperature (C): 46