Page MenuHomePhabricator

db2058: Broken storage
Closed, DeclinedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host db2058. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

I/O input error

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-hpssacli

Update: hw logs show broken storage:

/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Critical
    date=07/31/2019
    time=16:51
    description=Drive Array Controller Failure (Slot 0)

Event Timeline

Restricted Application added subscribers: Marostegui, Aklapper. · View Herald TranscriptWed, Jul 31, 5:10 PM
Volans triaged this task as Normal priority.Wed, Jul 31, 5:12 PM
Volans added a subscriber: DBA.

Change 526730 had a related patch set uploaded (by Volans; owner: Volans):
[operations/mediawiki-config@master] db-eqiad.php: depool db2058, I/O error

https://gerrit.wikimedia.org/r/526730

Volans added a subscriber: Volans.Wed, Jul 31, 5:17 PM

Unable to run hpssacli utility due to I/O error, I've depooled the host on dbctl and from db-codfw.php with the above patch (shortly). I'll look into logs after that.

Change 526730 merged by Volans:
[operations/mediawiki-config@master] db-eqiad.php: depool db2058, I/O error

https://gerrit.wikimedia.org/r/526730

Mentioned in SAL (#wikimedia-operations) [2019-07-31T17:21:24Z] <volans@deploy1001> Synchronized wmf-config/db-codfw.php: depool db2058, I/O error, T229449 (duration: 00m 54s)

host downtimed on icinga until Friday ~15UTC. chatted with @Marostegui and the host is due decommission, so no hurry, he'll take a look tomorrow.

Change 526836 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2058: Disable notifications

https://gerrit.wikimedia.org/r/526836

As expected, controller failure:

/system1/log1/record14
  Targets
  Properties
    number=14
    severity=Critical
    date=07/31/2019
    time=16:51
    description=Drive Array Controller Failure (Slot 0)
  Verbs

Change 526836 merged by Marostegui:
[operations/puppet@production] db2058: Disable notifications

https://gerrit.wikimedia.org/r/526836

Marostegui renamed this task from Degraded RAID on db2058 to db2058: Broken storage.Thu, Aug 1, 4:54 AM
Marostegui closed this task as Declined.
Marostegui edited projects, added DBA; removed Patch-For-Review, ops-codfw.
Marostegui updated the task description. (Show Details)

I am going to close this as this host will be decommissioned T228258: Decommission db2043-db2069

I rebooted the server and this is the boot message:

Slot 0  HP Smart Array P420i Controller        (1 GB, v6.00)  1 Logical Drive
1719-Slot 0 Drive Array - A controller failure event occurred prior to this
     power-up.  (Previous lock up code = 0x13)
1792-Slot 0 Drive Array - Valid Data Found in Write-Back Cache.
     Data will automatically be written to drive array.