Page MenuHomePhabricator

restbase2003 has a broken disk (at least)
Closed, ResolvedPublic

Description

Before going into kernel panic and refusing to reboot again, restbase2003 was writing continuously "attempting to write to an offline device"

During the restart, I saw

Embedded RAID : Smart Array P440ar Controller - (2048 MB, V2.52) 5 Logical
Drive(s) - Operation Failed
 - 1719-Slot 0 Drive Array  - A controller failure event occurred prior
   to this power-up. (Previous lock up code = 0x13)

the system is back up but I don't intend to put it back into rotation until hardware is inspected.

Event Timeline

the system is back up but I don't intend to put it back into rotation until hardware is inspected.

I guess out of rotation in this context probably means RESTBase(?); It appears Cassandra is online and participating in the storage cluster (and not exhibiting any errors at present).

From the SRE/CP Scalability meeting:

We should attempt to update the RAID firmware, and ensure that the host is capable of rebooting successfully without intervention. If things continue to look OK, we can close.

Mentioned in SAL (#wikimedia-operations) [2018-08-23T16:15:03Z] <godog> upgrade hp raid firmware on restbase2003 - T201804 T141756

fgiunchedi claimed this task.
fgiunchedi subscribed.

Firmware upgrade and reboot done, the reboot completed unattended. I'm tentatively resolving, we'll reopen if it occurs again.