Page MenuHomePhabricator

Degraded RAID on db2049
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host db2049. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:1 - Controller: OK - Battery/Capacitor: OK

Smart Array P420i in Slot 0 (Embedded)

   array A

      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Interim Recovery Mode
         Caching:  Enabled
         Unique Identifier: 600508B1001C4BB760F8A9895365C576
         Disk Name: /dev/sda 
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: A41E2B520014380337DD260ED1F
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Failed)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 2 2017, 2:52 AM
Marostegui assigned this task to Papaul.Jun 2 2017, 8:35 AM
Marostegui added a project: DBA.
Marostegui added subscribers: Papaul, Marostegui.

Hello @Papaul, please go ahead and replace the disk when you can!
Thanks!

Marostegui moved this task from Blocked external/Not db team to In progress on the DBA board.
jcrespo added a subscriber: jcrespo.Jun 2 2017, 4:55 PM

I think this one is under warranty, but it should be checked.

RobH added a subscriber: RobH.Jun 2 2017, 5:07 PM

This system is under warranty, detailed on the racktables page: https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=2685

It will remain in warranty until 2018-01-09.

This one is also showing the following alarm-

Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Power Unit 2 18-VR P2 = Critical]

I have no idea what this is (this is a new alarm). Is it complaining about a lack of BBU? The temperature of the BBU?

I thought that would be the PSU of the server not the BBU

Papaul added a comment.Jun 4 2017, 6:05 PM

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5320223380
Status: Case is generated and in Progress

Product description: HP ProLiant DL380p Gen8 12 LFF Configure-to-order Server
Product number: 665552-B21
Serial number: 2M245205HN
Subject: dl380p gen8 - hard drive failed

Yours sincerely,
Hewlett Packard Enterprise

This one is also showing the following alarm-

Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Power Unit 2 18-VR P2 = Critical]

I have no idea what this is (this is a new alarm). Is it complaining about a lack of BBU? The temperature of the BBU?

After the fixes done to the IPMI check (apparently we were alerting over old events: https://gerrit.wikimedia.org/r/#/c/357361/ - which might correlate to: T150876 who knows...)

˜/icinga-wm 13:10> RECOVERY - IPMI Temperature on db2049 is OK: Sensor Type(s) Temperature Status: OK
Papaul reassigned this task from Papaul to Marostegui.Jun 6 2017, 4:20 PM

Disk replacement complete

Rebuilding, will resolve once it is done.

jcrespo closed this task as Resolved.Jun 6 2017, 6:40 PM