Page MenuHomePhabricator

restbase2003 has a broken disk (at least)
Closed, ResolvedPublic

Description

Before going into kernel panic and refusing to reboot again, restbase2003 was writing continuously "attempting to write to an offline device"

During the restart, I saw

Embedded RAID : Smart Array P440ar Controller - (2048 MB, V2.52) 5 Logical
Drive(s) - Operation Failed
 - 1719-Slot 0 Drive Array  - A controller failure event occurred prior
   to this power-up. (Previous lock up code = 0x13)

the system is back up but I don't intend to put it back into rotation until hardware is inspected.

Event Timeline

Joe created this task.Aug 13 2018, 6:53 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2018, 6:53 AM
Eevans added a subscriber: Eevans.Aug 13 2018, 1:23 PM

the system is back up but I don't intend to put it back into rotation until hardware is inspected.

I guess out of rotation in this context probably means RESTBase(?); It appears Cassandra is online and participating in the storage cluster (and not exhibiting any errors at present).

Eevans added a comment.EditedAug 23 2018, 3:33 PM

From the SRE/CP Scalability meeting:

We should attempt to update the RAID firmware, and ensure that the host is capable of rebooting successfully without intervention. If things continue to look OK, we can close.

Mentioned in SAL (#wikimedia-operations) [2018-08-23T16:15:03Z] <godog> upgrade hp raid firmware on restbase2003 - T201804 T141756

fgiunchedi closed this task as Resolved.Aug 23 2018, 4:50 PM
fgiunchedi claimed this task.
fgiunchedi added a subscriber: fgiunchedi.

Firmware upgrade and reboot done, the reboot completed unattended. I'm tentatively resolving, we'll reopen if it occurs again.