Page MenuHomePhabricator

Degraded RAID on cloudvirt1020
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labvirt1020. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: no logical drives --- Slot 0: no drives --- Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Error: The specified device does not have any logical drives.

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: / 85.7 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 1600.3 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

Details

Related Gerrit Patches:
operations/puppet : productionDisable alerting on cloudvirt1019 and 1020

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openaborrero
Openaborrero
Resolvedchasemp
ResolvedBstorm
DeclinedNone
ResolvedBstorm
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedaborrero
DeclinedNone
DeclinedNone
ResolvedJclark-ctr
ResolvedHalfak
ResolvedHalfak
ResolvedBstorm
OpenNone

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 16 2018, 8:36 PM
Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.May 24 2018, 3:19 PM

I need a SSD report HP. Could you please install hpssaducli and run the report and email me the zip file.

Kindly provide us the Smart Wear Gauge report so that we can check if the drive has been fully consumed or not and qualifies for the replacement or not.

Kindly install the SSADU CLI utility for generating the Smart Wear Gauge report, in case if the utility is not installed then you get the utility at

Once the utility is installed then go to the directory /opt/hp/hpssaducli/bin or /opt/hpe/hpessaducli/bin or /opt/hpe/ssaducli/bin(whichever is applicable) and run the below command:

hpssaducli -ssdrpt -f ssd-report.zip or ssaducli -ssdrpt -f ssd-report.zip or hpessaducli -ssdrpt -f ssd-report.zip (whichever is applicable)

Bstorm added a subscriber: Bstorm.Jun 8 2018, 4:20 PM

This is the same alert as on T196507 and seems about the same issue (no battery reported and one controller reports no drives). I'd ask why SSD wear would matter, but I know the ways of vendors can be strange. I don't think we have any drive failures reported on this, do we?

Joe triaged this task as Medium priority.Jun 18 2018, 8:44 AM

@Bstorm after reinstall please let me know if this is still an issue.

This is currently still some kind of an issue on both servers. The thing is that I'm not sure if it is a problem or just describing reality (embedded controller has no disk and installed controller doesn't report a battery).

Vvjjkkii renamed this task from Degraded RAID on labvirt1020 to oucaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from oucaaaaaaa to Degraded RAID on labvirt1020.Jul 2 2018, 3:18 PM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Bstorm added a comment.Jul 5 2018, 9:50 PM

Disabled unused raid controller in the BIOS, which is at least half of this alert. However, this also is missing a battery, which HP considers an optional purchase that we should have.

RobH assigned this task to Cmjohnson.Jul 31 2018, 6:10 PM
RobH added a subscriber: RobH.

Ok, Dasher/HP states these shipped with battery systems already in place on the mainboard for the raid controllers, and have attached a file for review.

Since the pdf of the email has email address and contact info, I've had to set it to restricted view to members of the #acl*operations-team.

@Cmjohnson: Can you work to schedule downtime on labvirt1020 with @Bstorm and follow the PDF for checking for the physical existence of the raid controller battery? Dasher/HP states it shipped with the systems.

Please note that tasks T194855 (labvirt1020) & T196507 (labvirt1019) both are from the same order, same issues, and need the same checks done.

Andrew renamed this task from Degraded RAID on labvirt1020 to Degraded RAID on cloudvirt1020.Sep 11 2018, 1:23 AM

Change 478115 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Disable alerting on cloudvirt1019 and 1020

https://gerrit.wikimedia.org/r/478115

Change 478115 merged by Andrew Bogott:
[operations/puppet@production] Disable alerting on cloudvirt1019 and 1020

https://gerrit.wikimedia.org/r/478115

@Cmjohnson cloudvirt1020 is reporting a disk missing:

=> ctrl slot=1 pd all show       

Smart Array P840 in Slot 1

   Unassigned

      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK)

Could you request a replace (in case it has not been requested yet)?

It seems disk on "bay 5" might be faulty or something.

I've deleted the RAID configuration on this server before I realized the disk was missing (so I can't run Linux commands easily now). Please let me know if you need any other information.

cloudvirt1020 is also 5x slower to enter the BIOS menu (ESC+9) than cloudvirt1019. Not sure what that means.

The server is not doing anything right now, so feel free to play with it if you wish.

Andrew added a subscriber: Andrew.Feb 15 2019, 8:41 PM

note that I just now had to enable virtualization and associated settings on cloudvirt1019. so 1020 might need this as well -- probably best to double-check hyperthreading too

Checking cloudvirt1020:


So this should be ready to go from this point of view.

Mentioned in SAL (#wikimedia-operations) [2019-02-16T14:21:57Z] <arturo> T194855 cloudvirt1020 is poweroff, waiting for disk setup before installing

RobH added a comment.Feb 19 2019, 5:25 PM
This comment was removed by RobH.

I believe the supposed failed disk was a result of me working inside the server last week and I put it back together quickly. The cables that attach the raid card to the backplane for the ssds is very touchy and if it's slightly off it could show a disk offline. I confirmed this thought after I checked the raid bios and noticed the card was only seeing 9 of the 10 disks. I then swapped a disk from a different slot and the orange indicator light stayed with the slot. I opened the server up, reseated the cables and all 10 disks show but the raid had to be rebuilt. The server will need a full re-install.

Action taken
Connected the raid battery cable received from HPE to clear the raid battery status issue we've been having
Disabled the p480 onboard raid card to match cloudvirt1019
Ensured the virtualization settings were enabled
Updated the firmware with the Service Pack I have from HPE
Rebuilt the raid to Raid 10 256k stripe

@Cmjohnson thank you!

RAID reconfigured with spares.

=> ctrl slot=1 create type=ld drives=1I:1:5,1I:1:6,1I:1:7,1I:1:8,2I:1:1,2I:1:2,2I:1:3,2I:1:4 raid=1+0 ss=64 forced
=> ctrl slot=1 pd all show                       

Smart Array P840 in Slot 1

   Array A

      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA SSD, 1.6 TB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:1 (port 2I:box 1:bay 1, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:2 (port 2I:box 1:bay 2, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:3 (port 2I:box 1:bay 3, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:1:4 (port 2I:box 1:bay 4, SATA SSD, 1.6 TB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA SSD, 1.6 TB, OK, spare)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA SSD, 1.6 TB, OK, spare)

=> ctrl slot=1 ld all show

Smart Array P840 in Slot 1

   Array A

      logicaldrive 1 (5.8 TB, RAID 1+0, OK)
GTirloni closed this task as Resolved.Feb 20 2019, 5:43 PM
Bstorm reopened this task as Open.Feb 21 2019, 5:46 PM

Unfortunately, the icinga error has returned. https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudvirt1020&service=HP+RAID

This is not true of cloudvirt1019, which has no issues. There's something odd still.

Andrew closed this task as Resolved.Feb 21 2019, 6:29 PM

Looks good now. Thanks @Cmjohnson !