Degraded RAID on labstore1006
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Nov 19 2020, 9:31 PM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labstore1006. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK --- Slot 3: OK: 1E:1:1, 1E:1:10, 1E:1:11, 1E:1:12, 1E:1:2, 1E:1:3, 1E:1:4, 1E:1:5, 1E:1:6, 1E:1:7, 1E:1:8, 1E:1:9, 1E:2:1, 1E:2:11, 1E:2:12, 1E:2:2, 1E:2:3, 1E:2:4, 1E:2:5, 1E:2:6, 1E:2:7, 1E:2:8, 1E:2:9 - Failed: 1E:2:10 - Controller: OK - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Smart Array P441 in Slot 3

   array A

      Logical Drive: 1
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdc 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1E:1:1 (port 1E:box 1:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:2 (port 1E:box 1:bay 2, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:3 (port 1E:box 1:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:4 (port 1E:box 1:bay 4, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:5 (port 1E:box 1:bay 5, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:6 (port 1E:box 1:bay 6, SATA, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1E:1:7 (port 1E:box 1:bay 7, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:8 (port 1E:box 1:bay 8, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:9 (port 1E:box 1:bay 9, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:10 (port 1E:box 1:bay 10, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:11 (port 1E:box 1:bay 11, SATA, 6001.1 GB, OK)
            physicaldrive 1E:1:12 (port 1E:box 1:bay 12, SATA, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array B

      Logical Drive: 2
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Interim Recovery Mode
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdd 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1E:2:1 (port 1E:box 2:bay 1, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:2 (port 1E:box 2:bay 2, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:3 (port 1E:box 2:bay 3, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:4 (port 1E:box 2:bay 4, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:5 (port 1E:box 2:bay 5, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:6 (port 1E:box 2:bay 6, SAS, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1E:2:7 (port 1E:box 2:bay 7, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:8 (port 1E:box 2:bay 8, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:9 (port 1E:box 2:bay 9, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:10 (port 1E:box 2:bay 10, SAS, 0 MB, Failed)
            physicaldrive 1E:2:11 (port 1E:box 2:bay 11, SAS, 6001.1 GB, OK)
            physicaldrive 1E:2:12 (port 1E:box 2:bay 12, SAS, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 931.5 GB
         Fault Tolerance: 1
         Strip Size: 256 KB
         Full Stripe Size: 256 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sda 
         Mount Points: /boot 953 MB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 2I:4:1 (port 2I:box 4:bay 1, SATA, 1 TB, OK)
         Mirror Group 2:
            physicaldrive 2I:4:2 (port 2I:box 4:bay 2, SATA, 1 TB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

   array B

      Logical Drive: 2
         Size: 32.7 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Enabled
         Disk Name: /dev/sdb 
         Mount Points: None
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 6001.1 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SATA, 6001.1 GB, OK)
            physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SATA, 6001.1 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Related Objects
Search...

Status	Assigned	Task
Resolved	RobH	T161311 Eqiad: Hardware request for labstore1006/7, dataset1002/3
		Unknown Object (Task)
Resolved	• Cmjohnson	T217473 labstore1006 spontaneous reboot
Resolved	• nskaggs	T268280 labstore1006 spontaneous reboot
Resolved	• Cmjohnson	T268281 Degraded RAID on labstore1006
		Unknown Object (Task)

Event Timeline

ops-monitoring-bot created this task.Nov 19 2020, 9:31 PM

• Bstorm added a parent task: T268280: labstore1006 spontaneous reboot.Nov 19 2020, 9:48 PM

• Bstorm added a project: cloud-services-team (Hardware).

Mentioned in SAL (#wikimedia-cloud) [2020-11-20T09:26:21Z] <arturo> incinga downtime labstore1006 RAID checks for 10 days (T268281)

herron triaged this task as High priority.Nov 20 2020, 3:10 PM

Looking at the server it's not abundantly clear which disk or disks are bad. I do know this server is out of warranty and a disk or 2 will need to be purchased. Looping in @wiki_willy to facilitate a disk purcahse.

• Cmjohnson assigned this task to wiki_willy.Nov 24 2020, 4:10 PM

• Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.

Bad Disk:

Physical Drive in Port 1E Box 2 Bay 10

    Status	 Failed
    Serial Number	ZA1A4KEH
    Model	MB6000JVYYV
    Media Type	HDD
    Capacity	0 GB
    Location	Port 1E Box 2 Bay 10
    Firmware Version	HPD2
    Drive Configuration	Configured
    Encryption Status	Not Encrypted

A working disk for an example:

Physical Drive in Port 1E Box 2 Bay 11

    Status	 OK
    Serial Number	ZA1AKET7
    Model	MB6000JVYYV
    Media Type	HDD
    Capacity	6000 GB
    Location	Port 1E Box 2 Bay 11
    Firmware Version	HPD2
    Drive Configuration	Configured
    Encryption Status	Not Encrypted

Ok, so to determine the bad disk I pulled up the HP https mgmt console, which displays storage information for the host. https://labstore1006.mgmt.eqiad.wmnet/.

The disk model is HPE MB6000JVYYV 6tb 7200rpm 3.5 Inch Lff Sas-12gbps Midline Sc Hot Swap Hard Drive (with tray). We can either go HP route and get another identical disk from Dasher, or go third party disk and hope its sector count meets or exceeds current/older 6TB disks in the raid array.

I'll request a quote from Dasher on a linked procurement task shortly.

RobH mentioned this in Unknown Object (Task).Nov 24 2020, 7:34 PM

RobH added a subtask: Unknown Object (Task).Nov 24 2020, 7:37 PM

It appears that the defective disk is located in box2, which is array2. array2 is under warranty until 2021-05-28. So this will need a normal HP warranty replacement case opened. While they do not have self dispatch, I've opened previous support cases successfully via the web rather than calling.

A case has been opened with HPE 5351787485

• bd808 moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.Nov 25 2020, 4:25 PM

@RobH Can you attempt to pull the ADU report off this server. I cannot get into the U/I through mgmt. Let me know if you can and I will forward you the ftp site.

In T268281#6663112, @Cmjohnson wrote:

@RobH Can you attempt to pull the ADU report off this server. I cannot get into the U/I through mgmt. Let me know if you can and I will forward you the ftp site.

Do you mean the 'Active Health System Log' where i have to provide a date range and it has options for support case info? I was able to pull this (for November), but cannot upload it to phabricator. Instead, its been placed in google drive > shared drives > Datacenter Operations > HP Logs> T268281.labstore1006.HPE_MXQ72106XS_20201202.ahs

After a few days of back and forth nonsensical emails with HPE they are finally shipping the disk today.

RobH closed subtask Unknown Object (Task) as Invalid.Dec 8 2020, 4:24 PM

The disk has been replaced, I am not sure if you have it for auto rebuild. Please check and if the problem persists, re-open this task.

Degraded RAID on labstore1006Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Degraded RAID on labstore1006
Closed, ResolvedPublic
Actions

Related Objects
Search...