Page MenuHomePhabricator

Degraded RAID on labstore1006
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labstore1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:10, 1E:1:11, 1E:1:12, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:2:1, 1E:2:11, 1E:2:12, 1E:2:2, 1E:2:3, 1E:2:4, 1E:2:5, 1E:2:6, 1E:2:7, 1E:2:8, 1E:2:9 - Failed: 1E:2:10 - Controller: OK - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P441 in Slot 3

   array A

      Logical Drive: 1
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1E:1:1 (port 1E:box 1:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:2 (port 1E:box 1:bay 2, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:3 (port 1E:box 1:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:4 (port 1E:box 1:bay 4, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:5 (port 1E:box 1:bay 5, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:6 (port 1E:box 1:bay 6, SATA, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1E:1:7 (port 1E:box 1:bay 7, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:8 (port 1E:box 1:bay 8, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:9 (port 1E:box 1:bay 9, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:10 (port 1E:box 1:bay 10, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:11 (port 1E:box 1:bay 11, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:12 (port 1E:box 1:bay 12, SATA, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array B

      Logical Drive: 2
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Interim Recovery Mode
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1E:2:1 (port 1E:box 2:bay 1, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:2 (port 1E:box 2:bay 2, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:3 (port 1E:box 2:bay 3, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:4 (port 1E:box 2:bay 4, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:5 (port 1E:box 2:bay 5, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:6 (port 1E:box 2:bay 6, SAS, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1E:2:7 (port 1E:box 2:bay 7, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:8 (port 1E:box 2:bay 8, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:9 (port 1E:box 2:bay 9, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:10 (port 1E:box 2:bay 10, SAS, 0 MB, Failed)
            physicaldrive 1E:2:11 (port 1E:box 2:bay 11, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:12 (port 1E:box 2:bay 12, SAS, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 931.5 GB
         Fault Tolerance: 1
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sda 
         Mount Points: /boot 953 MB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 2I:4:1 (port 2I:box 4:bay 1, SATA, 1 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:4:2 (port 2I:box 4:bay 2, SATA, 1 TB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array B

      Logical Drive: 2
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdb 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SATA, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Related Objects

StatusSubtypeAssignedTask
ResolvedRobH
ResolvedCmjohnson
OpenNone
ResolvedCmjohnson

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-11-20T09:26:21Z] <arturo> incinga downtime labstore1006 RAID checks for 10 days (T268281)

herron triaged this task as High priority.Nov 20 2020, 3:10 PM

Looking at the server it's not abundantly clear which disk or disks are bad. I do know this server is out of warranty and a disk or 2 will need to be purchased. Looping in @wiki_willy to facilitate a disk purcahse.

Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.
RobH added a subscriber: RobH.Nov 24 2020, 7:30 PM

Bad Disk:

Physical Drive in Port 1E Box 2 Bay 10

    Status	 Failed
    Serial Number	ZA1A4KEH
    Model	MB6000JVYYV
    Media Type	HDD
    Capacity	0 GB
    Location	Port 1E Box 2 Bay 10
    Firmware Version	HPD2
    Drive Configuration	Configured
    Encryption Status	Not Encrypted

A working disk for an example:

Physical Drive in Port 1E Box 2 Bay 11

    Status	 OK
    Serial Number	ZA1AKET7
    Model	MB6000JVYYV
    Media Type	HDD
    Capacity	6000 GB
    Location	Port 1E Box 2 Bay 11
    Firmware Version	HPD2
    Drive Configuration	Configured
    Encryption Status	Not Encrypted

Ok, so to determine the bad disk I pulled up the HP https mgmt console, which displays storage information for the host. https://labstore1006.mgmt.eqiad.wmnet/.

The disk model is HPE MB6000JVYYV 6tb 7200rpm 3.5 Inch Lff Sas-12gbps Midline Sc Hot Swap Hard Drive (with tray). We can either go HP route and get another identical disk from Dasher, or go third party disk and hope its sector count meets or exceeds current/older 6TB disks in the raid array.

I'll request a quote from Dasher on a linked procurement task shortly.

RobH mentioned this in Unknown Object (Task).Nov 24 2020, 7:34 PM
RobH added a subtask: Unknown Object (Task).Nov 24 2020, 7:37 PM
RobH reassigned this task from wiki_willy to Cmjohnson.EditedNov 24 2020, 8:48 PM

It appears that the defective disk is located in box2, which is array2. array2 is under warranty until 2021-05-28. So this will need a normal HP warranty replacement case opened. While they do not have self dispatch, I've opened previous support cases successfully via the web rather than calling.

A case has been opened with HPE 5351787485

@RobH Can you attempt to pull the ADU report off this server. I cannot get into the U/I through mgmt. Let me know if you can and I will forward you the ftp site.

RobH added a comment.Dec 2 2020, 6:59 PM

@RobH Can you attempt to pull the ADU report off this server. I cannot get into the U/I through mgmt. Let me know if you can and I will forward you the ftp site.

Do you mean the 'Active Health System Log' where i have to provide a date range and it has options for support case info? I was able to pull this (for November), but cannot upload it to phabricator. Instead, its been placed in google drive > shared drives > Datacenter Operations > HP Logs> T268281.labstore1006.HPE_MXQ72106XS_20201202.ahs

After a few days of back and forth nonsensical emails with HPE they are finally shipping the disk today.

RobH closed subtask Unknown Object (Task) as Invalid.Dec 8 2020, 4:24 PM
Cmjohnson closed this task as Resolved.Dec 9 2020, 5:39 PM

The disk has been replaced, I am not sure if you have it for auto rebuild. Please check and if the problem persists, re-open this task.