Page MenuHomePhabricator

Degraded RAID on db2033
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host db2033. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Smart Array P420i in Slot 0 (Embedded)

   array A

      Logical Drive: 1
         Size: 3.3 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1536 KB
         Status: Interim Recovery Mode
         Caching:  Enabled
         Disk Name: /dev/sda 
         Mount Points: / 37.3 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Failed)
            physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
            physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
            physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
         Mirror Group 2:
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
            physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
            physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
            physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
            physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)
         Drive Type: Data
         LD Acceleration Method: Controller Cache

Event Timeline

Restricted Application added subscribers: Marostegui, Aklapper. · View Herald TranscriptAug 11 2018, 12:28 PM
jcrespo closed this task as Declined.Aug 13 2018, 7:02 AM
jcrespo added a subscriber: jcrespo.

The host will be decommission as soon as a new codfw x1 host is purchased: T184888

Marostegui reopened this task as Open.Aug 23 2018, 2:45 PM
Marostegui assigned this task to Papaul.
Marostegui added a subscriber: Papaul.

I have been talking to @Papaul and we can re-use db2064's BBU (T195228) to replace db2033's (T184888)
Given the fact that

  1. This host is really scheduled for decommission in a year (July 2019) we can still use it for another year
  2. I am not sure the host for x1 will arrive on time.

So I have agreed with @Papaul to:

  1. Replace this failed disk
  2. Once the RAID has rebuilt - power off this host and replace the BBU

Thanks @Papaul

Interesting, the BBU now doesn't look broken anymore.
Maybe another case of BBUs recovering after a reboot:

root@db2033:~# hpssacli controller all show status

Smart Array P420i in Slot 0 (Embedded)
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

At least we know we can use db2064's just in case.
I have reseted back the defaults:

root@db2033:~#  hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Enabled
root@db2033:~# hpssacli ctrl slot=0 modify dwc=disable
root@db2033:~#  hpssacli controller all show detail | grep "Drive Write Cache"
   Drive Write Cache: Disabled
Papaul reassigned this task from Papaul to Marostegui.Aug 23 2018, 2:57 PM

complete

Thanks!

logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding)

Please upgrade kernel and mariadb server version on reboot! Thanks.

Marostegui reassigned this task from Marostegui to Papaul.Aug 25 2018, 12:27 PM

Can you pull the disk out wait a couple of minutes and insert it again? It failed to rebuild

Thanks Papaul

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, Rebuilding)

Let's see if it goes well this time
Thanks!

This finished fine!

logicaldrive 1 (3.3 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 600 GB, OK)
physicaldrive 1I:1:5 (port 1I:box 1:bay 5, SAS, 600 GB, OK)
physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, OK)
physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, OK)
physicaldrive 1I:1:8 (port 1I:box 1:bay 8, SAS, 600 GB, OK)
physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS, 600 GB, OK)
physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS, 600 GB, OK)
physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS, 600 GB, OK)
physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, OK)

I will repool the host and close this task once done.

Change 455764 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool db2033

https://gerrit.wikimedia.org/r/455764

Change 455764 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool db2033

https://gerrit.wikimedia.org/r/455764

Mentioned in SAL (#wikimedia-operations) [2018-08-28T06:57:21Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2033 - T201757 (duration: 00m 49s)

Marostegui closed this task as Resolved.Aug 28 2018, 6:59 AM

Server repooled