Page MenuHomePhabricator

Degraded RAID on db2067
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host db2067. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: Failed: 1I:1:10 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Smart Array P420i in Slot 0 (Embedded)

   array A

      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Interim Recovery Mode
         Caching:  Enabled
         Disk Name: /dev/sda 
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

Restricted Application added subscribers: Marostegui, Aklapper. · View Herald TranscriptMay 8 2018, 1:00 AM
Marostegui assigned this task to Papaul.May 8 2018, 5:16 AM
Marostegui added a project: DBA.
Marostegui added a subscriber: Papaul.

@Papaul can we get a new disk for this one?
Thanks!

Marostegui moved this task from Triage to In progress on the DBA board.May 8 2018, 5:27 AM
Papaul reassigned this task from Papaul to Marostegui.May 8 2018, 2:55 PM

@Marostegui Disk replacement complete

Thanks!

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding)

The disk has failed to rebuild, can we try another one?:

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed)

Thanks!

Marostegui reassigned this task from Marostegui to Papaul.May 9 2018, 10:54 AM
Papaul reassigned this task from Papaul to Marostegui.May 10 2018, 2:41 PM

@Marostegui replaced the disk with another disk.

jcrespo claimed this task.May 10 2018, 2:42 PM
jcrespo added a subscriber: jcrespo.

Manuel is not around today, I am taking the task.

@Papaul can you check the disk used? It says it has 300GB, either it is wrongly detected or a mistake (this host needs 600GB disks), and it failed becaue of that:

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 300 GB, Failed)
jcrespo reassigned this task from jcrespo to Papaul.May 10 2018, 5:06 PM
Papaul reassigned this task from Papaul to jcrespo.May 14 2018, 2:24 PM

@jcrespo disk replacement complete

Thanks!
Let's see how it goes

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding)
Marostegui closed this task as Resolved.May 14 2018, 7:20 PM

This time it worked fine
Thanks Papaul!

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
jcrespo reopened this task as Open.May 15 2018, 9:12 AM

Potential SMART errors on that device.

PROBLEM - Device not healthy -SMART- on db2067 is CRITICAL: cluster=mysql device=cciss,9 instance=db2067:9100 job=node site=codfw
Marostegui reassigned this task from jcrespo to Papaul.May 16 2018, 5:21 AM

It is indeed on predictive failure:

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Predictive Failure)

@Papaul can we replace it again?
Thanks!

Papaul reassigned this task from Papaul to Marostegui.May 17 2018, 2:37 PM

@Marostegui disk replacement complete.

Marostegui reassigned this task from Marostegui to Papaul.May 17 2018, 2:41 PM

That disk has failed :(

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Failed)
Papaul reassigned this task from Papaul to Marostegui.May 17 2018, 2:54 PM
Papaul triaged this task as Medium priority.

@Marostegui another one in place

Cross your fingers!

physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, Rebuilding)

This time it worked

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
Marostegui closed this task as Resolved.May 18 2018, 5:43 AM
Marostegui mentioned this in Unknown Object (Task).May 29 2018, 3:32 PM
Vvjjkkii renamed this task from Degraded RAID on db2067 to kfdaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Marostegui as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Vachovec1 renamed this task from kfdaaaaaaa to Degraded RAID on db2067.Jul 1 2018, 3:50 PM
Vachovec1 closed this task as Resolved.
Vachovec1 assigned this task to Marostegui.
Vachovec1 lowered the priority of this task from High to Medium.
Vachovec1 updated the task description. (Show Details)