Page MenuHomePhabricator

Degraded RAID on labsdb1009
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labsdb1009. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 1: Failed: 1I:1:9 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 1I:1:14, 1I:1:15, 1I:1:16 - Controller: OK - Battery/Capacitor: OK



$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli



Smart Array P840 in Slot 1



   array A



      Logical Drive: 1

         Size: 11.6 TB

         Fault Tolerance: 1+0

         Strip Size: 256 KB

         Full Stripe Size: 2048 KB

         Status: Interim Recovery Mode

         MultiDomain Status: OK

         Caching:  Disabled

         Disk Name: /dev/sda 

         Mount Points: / 37.2 GB Partition Number 2

         OS Status: LOCKED

         Mirror Group 1:

            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)

         Mirror Group 2:

            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, Solid State SATA, 1600.3 GB, Failed)

            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:13 (port 1I:box 1:bay 13, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:14 (port 1I:box 1:bay 14, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:15 (port 1I:box 1:bay 15, Solid State SATA, 1600.3 GB, OK)

            physicaldrive 1I:1:16 (port 1I:box 1:bay 16, Solid State SATA, 1600.3 GB, OK)

         Drive Type: Data

         LD Acceleration Method: HPE SSD Smart Path

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui added a project: DBA.

@Cmjohnson can we get a new disk ordered? This host should be under warranty

Thanks!

Support ticket created with HPE

Case ID: 5329764075
Case title:
Failed Hard Drive
Severity 3-Normal
Product serial number: MXQ62005Z0
Product number: 767032-B21
Submitted: 5/29/2018 12:50:38 PM
Last updated: 5/29/2018 12:50:38 PM
Source: Web
Case status: Received by HPE

@Marostegui I need to access the servers smart storage administrator which requires me to boot into during the post. When would be a good time for me to take the server down for 15 minutes? HP requires this to replace the SSD. There is an option to install their software on our server and get the report that way if it's easier for you.

This is what HP has to say about it

Kindly provide us the Smart Wear Gauge report so that we can check if the drive has been fully consumed or not and qualifies for the replacement or not.

Kindly install the SSADU CLI utility for generating the Smart Wear Gauge report, in case if the utility is not installed then you get the utility at

Once the utility is installed then go to the directory /opt/hp/hpssaducli/bin or /opt/hpe/hpessaducli/bin or /opt/hpe/ssaducli/bin(whichever is applicable) and run the below command:

hpssaducli -ssdrpt -f ssd-report.zip or ssaducli -ssdrpt -f ssd-report.zip or hpessaducli -ssdrpt -f ssd-report.zip (whichever is applicable)

The SSD drives are consumable parts and qualifies for replacement only if they are not completely worn out(Kindly refer to the link https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c03312456 )

Thanks

@Cmjohnson I can depool the server for you tomorrow. Does that work?

We can put labsdb1009 down, but for the future, we should install the utilities on the appropiate hosts- we shouldn't have to restart a server just to be able to change a disk.

I thought about it but there are no deb packages: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_b256f556f71b41cf99c67fc608&swEnvOid=4004#tab1
We can probably use alien to see what we get, but I wouldn't like to block this maintenance on us experimenting with that. I rather get the disk replaced and then we can think about how to move forward with these HP requirements.

@Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already.

Looks like it was easier than expected and I was able to extract the binary after converting the rpm to deb.
I have run:

root@labsdb1009:/home/marostegui# ./hpssaducli -ssd -f ssd-report.zip
HP Smart Storage Diagnostics 2.20.11.0

@Cmjohnson I have sent you an email with the .zip file

@Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already.

And the component is enabled by default on all baremetal hosts.

@Marostegui : hpssaducli is present in the thirdparty/hwraid component for stretch already.

And the component is enabled by default on all baremetal hosts.

I just saw it is in the repo but it wasn't installed
We should add it by default on the puppet recipe for HP hosts probably

I just saw it is in the repo but it wasn't installed
We should add it by default on the puppet recipe for HP hosts probably

Yeah, modules/raid/manifests/init.pp currently only installs hpssacli itself.

The report was sent to HP yesterday, i have not heard back from them yet. If I don't get something in the next few hours I will ping them.

This is Regarding the Case Number 5329764075 for HPE ProLiant DL380 Gen9 8SFF Configure-to-order Server

Issue: SCM_HW:Failed Hard Drive

Thanks a lot for sharing the Smart Wear Gauge report.The drive has not worn out and we will ship the replacement.:

Smart Array P840 in slot 1 : Internal Drive Cage at Port 1I : Box 1 : Physical Drive (1.6 TB SATA SSD) 1I:1:9 : SmartSSD Wear Gauge

Status                               FAILED
Supported                            TRUE
Log Full                             FALSE
Utilization                          1.160000
Power On Hours                       16569
Has Smart Trip SSD Wearout           FALSE
Remaining Days Until Wearout         58824
Has 56 Day Warning                   FALSE
Has Utilization Warning              NONE

@RobH Can you check if we have next-business day support for defects for this hw provider and purchase? Because they seem to not be honoring that/adding some on-purpose delay.

@RobH Can you check if we have next-business day support for defects for this hw provider and purchase? Because they seem to not be honoring that/adding some on-purpose delay.

Yeah, that is a good question, because as per T195690#4240202 the ticket with them was opened the 29th of May, so that is over a week now.

We had next day support until it expired on May 25th. However, if this case was open before hten, they should honor the warranty.

We had next day support until it expired on May 25th. However, if this case was open before hten, they should honor the warranty.

No, it was opened 29th: T195690#4240202

We had next day support until it expired on May 25th. However, if this case was open before hten, they should honor the warranty.

I misread the racktables entry, its good until 2019: https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=3110

All servers are purchased with 3 year next business day warranty support.

Disk arrived and got replaced. Note, it is bigger than the other ones.
It is rebuilding:

logicaldrive 1 (11.6 TB, RAID 1+0, Recovering, 2% complete)

 physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:9 (port 1I:box 1:bay 9, Solid State SATA, 1920.3 GB, Rebuilding)
 physicaldrive 1I:1:10 (port 1I:box 1:bay 10, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:11 (port 1I:box 1:bay 11, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:12 (port 1I:box 1:bay 12, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:13 (port 1I:box 1:bay 13, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:14 (port 1I:box 1:bay 14, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:15 (port 1I:box 1:bay 15, Solid State SATA, 1600.3 GB, OK)
 physicaldrive 1I:1:16 (port 1I:box 1:bay 16, Solid State SATA, 1600.3 GB, OK)

All good! Thank you!!

logicaldrive 1 (11.6 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, Solid State SATA, 1920.3 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:13 (port 1I:box 1:bay 13, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:14 (port 1I:box 1:bay 14, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:15 (port 1I:box 1:bay 15, Solid State SATA, 1600.3 GB, OK)
physicaldrive 1I:1:16 (port 1I:box 1:bay 16, Solid State SATA, 1600.3 GB, OK)
Vvjjkkii renamed this task from Degraded RAID on labsdb1009 to h7baaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from h7baaaaaaa to Degraded RAID on labsdb1009.Jul 1 2018, 7:00 PM
Marostegui closed this task as Resolved.
Marostegui assigned this task to Cmjohnson.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)