Page MenuHomePhabricator

frdb1001 has suffered a raid event resulting in /dev/sda going read only
Closed, ResolvedPublic

Description

At approximately1932 UTC, frdb1001 suffered a raid event.

[Mon Oct 28 19:32:00 2019] hpsa 0000:08:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1
[Mon Oct 28 19:32:24 2019] hpsa 0000:08:00.0: Controller lockup detected: 0xffff0000 after 30
[Mon Oct 28 19:32:24 2019] hpsa 0000:08:00.0: controller lockup detected: LUN:0000004000000000 CDB:01040000000000000000000000000000
[Mon Oct 28 19:32:24 2019] hpsa 0000:08:00.0: Controller lockup detected during reset wait
[Mon Oct 28 19:32:24 2019] hpsa 0000:08:00.0: scsi 0:1:0:0: reset logical failed Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1

This resulted in the device being offlined and causing a service outage.

The hardware will need investigation and possible disk or controller replacement.

Related Objects

StatusSubtypeAssignedTask
ResolvedDwisehaupt

Event Timeline

Dwisehaupt triaged this task as Unbreak Now! priority.Oct 28 2019, 8:49 PM
Dwisehaupt created this task.
Restricted Application added subscribers: Pcoombe, Liuxinyu970226, Aklapper. · View Herald TranscriptOct 28 2019, 8:49 PM

Change 546762 had a related patch set uploaded (by Jgreen; owner: Jgreen):
[operations/puppet@production] switch fundraising queue monitoring from frdb1001 to frdb1002

https://gerrit.wikimedia.org/r/546762

Change 546762 merged by Jgreen:
[operations/puppet@production] switch fundraising queue monitoring from frdb1001 to frdb1002

https://gerrit.wikimedia.org/r/546762

The controller has tossed another error and sent the filesystems to read-only again. Attaching the dmesg output from that event also.

Jgreen mentioned this in Unknown Object (Task).Oct 29 2019, 11:51 AM
DStrine moved this task from Triage to FR-Ops on the Fundraising-Backlog board.Oct 29 2019, 3:36 PM
Jgreen added a subscriber: Jgreen.Oct 30 2019, 3:46 PM

In addition to the repair, we're looking at adding another db system to the cluster for capacity/redundancy expansion. See T236920

RobH assigned this task to Jclark-ctr.Oct 30 2019, 4:28 PM
RobH added subscribers: Cmjohnson, Jclark-ctr, RobH.

Please note the replacement raid controller is now being purchased via T236779. Once it arrives, this task should become the highest priority on-site repair task.

RobH added a subtask: Unknown Object (Task).Nov 4 2019, 7:55 PM

Replaced Raid controller and bbu

Jgreen closed subtask Unknown Object (Task) as Resolved.Nov 7 2019, 3:54 PM
Jgreen reassigned this task from Jclark-ctr to Dwisehaupt.Nov 7 2019, 4:04 PM

Reassigning to @Dwisehaupt since he's done the heavy lifting on the reimage/recommissions.

Database was recloned with the bulk of the data coming from the dev host and the few remaining tables coming from frdb2001. Mysql is started, replication caught up, and the host is ready to go back in rotation. Still a little cleanup to do related to GTID tables but from a hardware and software standpoint the server is back online and usable in a regular fashion.

Dwisehaupt closed this task as Resolved.Nov 8 2019, 10:34 PM