Degraded RAID on db2049
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jun 2 2017, 2:52 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host db2049. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:1 - Controller: OK - Battery/Capacitor: OK

Smart Array P420i in Slot 0 (Embedded)

   array A

      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Interim Recovery Mode
         Caching:  Enabled
         Unique Identifier: 600508B1001C4BB760F8A9895365C576
         Disk Name: /dev/sda 
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Logical Drive Label: A41E2B520014380337DD260ED1F
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Failed)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Related Objects

Mentioned Here: T150876: db2049 overheated and restarted

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Jun 2 2017, 2:52 AM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 2 2017, 2:52 AM

Hello @Papaul, please go ahead and replace the disk when you can!
Thanks!

• Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.Jun 2 2017, 8:37 AM

• Marostegui moved this task from Blocked external/Not db team to In progress on the DBA board.

I think this one is under warranty, but it should be checked.

This system is under warranty, detailed on the racktables page: https://racktables.wikimedia.org/index.php?page=object&tab=default&object_id=2685

It will remain in warranty until 2018-01-09.

This one is also showing the following alarm-

Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Power Unit 2 18-VR P2 = Critical]

I have no idea what this is (this is a new alarm). Is it complaining about a lack of BBU? The temperature of the BBU?

I thought that would be the PSU of the server not the BBU

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5320223380
Status: Case is generated and in Progress

Product description: HP ProLiant DL380p Gen8 12 LFF Configure-to-order Server
Product number: 665552-B21
Serial number: 2M245205HN
Subject: dl380p gen8 - hard drive failed

Yours sincerely,
Hewlett Packard Enterprise

In T166853#3311393, @jcrespo wrote:
This one is also showing the following alarm-
Sensor Type(s) Temperature Status: Critical [Power Unit 2 18-VR P2 = Critical, Power Unit 2 18-VR P2 = Critical]
I have no idea what this is (this is a new alarm). Is it complaining about a lack of BBU? The temperature of the BBU?

After the fixes done to the IPMI check (apparently we were alerting over old events: https://gerrit.wikimedia.org/r/#/c/357361/ - which might correlate to: T150876 who knows...)

˜/icinga-wm 13:10> RECOVERY - IPMI Temperature on db2049 is OK: Sensor Type(s) Temperature Status: OK

Disk replacement complete

Rebuilding, will resolve once it is done.

jcrespo closed this task as Resolved.Jun 6 2017, 6:40 PM

Degraded RAID on db2049Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on db2049
Closed, ResolvedPublic
Actions