Degraded RAID on cloudvirt1020
Open, NormalPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labvirt1020. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: no logical drives --- Slot 0: no drives --- Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Error: The specified device does not have any logical drives.

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: / 85.7 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 1600.3 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 16 2018, 8:36 PM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.May 24 2018, 3:19 PM

I need a SSD report HP. Could you please install hpssaducli and run the report and email me the zip file.

Kindly provide us the Smart Wear Gauge report so that we can check if the drive has been fully consumed or not and qualifies for the replacement or not.

Kindly install the SSADU CLI utility for generating the Smart Wear Gauge report, in case if the utility is not installed then you get the utility at

Once the utility is installed then go to the directory /opt/hp/hpssaducli/bin or /opt/hpe/hpessaducli/bin or /opt/hpe/ssaducli/bin(whichever is applicable) and run the below command:

hpssaducli -ssdrpt -f ssd-report.zip or ssaducli -ssdrpt -f ssd-report.zip or hpessaducli -ssdrpt -f ssd-report.zip (whichever is applicable)

Cmjohnson moved this task from Up next to Being worked on on the ops-eqiad board.Jun 5 2018, 6:15 PM
Bstorm added a subscriber: Bstorm.Jun 8 2018, 4:20 PM

This is the same alert as on T196507 and seems about the same issue (no battery reported and one controller reports no drives). I'd ask why SSD wear would matter, but I know the ways of vendors can be strange. I don't think we have any drive failures reported on this, do we?

Joe triaged this task as Normal priority.Jun 18 2018, 8:44 AM

@Bstorm after reinstall please let me know if this is still an issue.

This is currently still some kind of an issue on both servers. The thing is that I'm not sure if it is a problem or just describing reality (embedded controller has no disk and installed controller doesn't report a battery).

Vvjjkkii renamed this task from Degraded RAID on labvirt1020 to oucaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot renamed this task from oucaaaaaaa to Degraded RAID on labvirt1020.
CommunityTechBot added a subscriber: Aklapper.
Bstorm added a comment.Jul 5 2018, 9:50 PM

Disabled unused raid controller in the BIOS, which is at least half of this alert. However, this also is missing a battery, which HP considers an optional purchase that we should have.

RobH assigned this task to Cmjohnson.Jul 31 2018, 6:10 PM
RobH added a subscriber: RobH.

Ok, Dasher/HP states these shipped with battery systems already in place on the mainboard for the raid controllers, and have attached a file for review.

Since the pdf of the email has email address and contact info, I've had to set it to restricted view to members of the #acl*operations-team.

@Cmjohnson: Can you work to schedule downtime on labvirt1020 with @Bstorm and follow the PDF for checking for the physical existence of the raid controller battery? Dasher/HP states it shipped with the systems.

Please note that tasks T194855 (labvirt1020) & T196507 (labvirt1019) both are from the same order, same issues, and need the same checks done.

Andrew renamed this task from Degraded RAID on labvirt1020 to Degraded RAID on cloudvirt1020.Sep 11 2018, 1:23 AM

Change 478115 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Disable alerting on cloudvirt1019 and 1020

https://gerrit.wikimedia.org/r/478115

Change 478115 merged by Andrew Bogott:
[operations/puppet@production] Disable alerting on cloudvirt1019 and 1020

https://gerrit.wikimedia.org/r/478115