Page MenuHomePhabricator

Degraded RAID on labvirt1019
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labvirt1019. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: no logical drives --- Slot 0: no drives --- Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Error: The specified device does not have any logical drives.

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: / 85.7 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 1600.3 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 17 2018, 10:45 PM
Dzahn added a subscriber: Dzahn.May 18 2018, 1:50 AM

duplicate of T194851

the event-handler created the ticket twice. that in itself might deserve another ticket.

Volans added a subscriber: Volans.May 18 2018, 9:42 AM

@Dzahn That usually happens if the alarm flap on icinga for some reason, the handler open a new task for each CRITICAL/HARD triggered by Icinga.

I'll check with the Cloud team though because in this case the disks are ok, but the error is:

Error: The specified device does not have any logical drives.

So I'll check if what kind of work is been done on those two hosts. (see T194855 for the other one)

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.May 24 2018, 3:19 PM

I've double checked both the report script that populate this task and the Icinga check script that raised the alarm. The issue here seems to be that the controller in Slot 1 (the P840 actually used) doesn't have/recognize the battery, hence the CRITICAL:

$ sudo /usr/sbin/hpssacli controller all show status

Smart Array P440ar in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: Not Configured
   Battery/Capacitor Status: OK

Smart Array P840 in Slot 1
   Controller Status: OK
   Cache Status: Not Configured

$ sudo /usr/sbin/hpssacli controller slot=1 show detail | grep -i battery
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0

Forgot to mention that the above message and output was taken on labvirt1020 as I cannot ssh to 1019 right now.

Bstorm added a subscriber: Bstorm.May 24 2018, 5:40 PM

Yes there's network work being done on 1019 at the moment. That said, they are identical machines.

@Cmjohnson is this controller really missing the battery or it's a software problem that is just not recognized?

@vlolans, it's possible the battery is wrong. I disconnected it during the card upgrade and I may have left the old battery and not replaced with the new one.

Cmjohnson closed this task as Resolved.Jun 5 2018, 6:08 PM
Cmjohnson claimed this task.

The battery was wrong and has been replaced.

Vvjjkkii renamed this task from Degraded RAID on labvirt1019 to 8scaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii triaged this task as High priority.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot assigned this task to Cmjohnson.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot closed this task as Resolved.
CommunityTechBot renamed this task from 8scaaaaaaa to Degraded RAID on labvirt1019.
CommunityTechBot added a subscriber: Aklapper.