Page MenuHomePhabricator

ms-be1022 smart storage battery failure; disk sdb possibly bad
Closed, ResolvedPublic

Description

ms-be1022 had hung. Powercycling it and looking at iLO log showed the following:

313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400.
Action: Restart system. Contact HPE support if condition persists.
</>hpiLO-> show /system1/log1/record41 

status=0
status_tag=COMMAND COMPLETED
Sun Nov 15 13:42:00 2020



/system1/log1/record41
  Targets
  Properties
    number=41
    severity=Caution
    date=11/15/2020
    time=13:15
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

We're also getting I/O errors on disk sdb, which is one of the mirrors for the root filesystem:

Nov 15 14:02:41 ms-be1022 kernel: [   11.586463] sd 0:1:0:1: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 15 14:02:41 ms-be1022 kernel: [   11.586468] sd 0:1:0:1: [sdb] tag#13 Sense Key : Aborted Command [current] 
Nov 15 14:02:41 ms-be1022 kernel: [   11.586473] sd 0:1:0:1: [sdb] tag#13 Add. Sense: Information unit iuCRC error detected
Nov 15 14:02:41 ms-be1022 kernel: [   11.586478] sd 0:1:0:1: [sdb] tag#13 CDB: Read(10) 28 00 17 48 db f0 00 00 08 00
Nov 15 14:02:41 ms-be1022 kernel: [   11.586481] blk_update_request: I/O error, dev sdb, sector 390650864

Event Timeline

@fgiunchedi The server is out of warranty, I have some decom'd HP servers and most likely can steal a bbu from one of them. I also have decom'd host w/3TB disks that we can take from. This server will require downtime, also worth noting the 4 new ms-be hosts are here and in the rack and will be ready for you by the end of the week (at the latest). In case you want to decom ms-be1022.

@fgiunchedi The server is out of warranty, I have some decom'd HP servers and most likely can steal a bbu from one of them. I also have decom'd host w/3TB disks that we can take from. This server will require downtime, also worth noting the 4 new ms-be hosts are here and in the rack and will be ready for you by the end of the week (at the latest). In case you want to decom ms-be1022.

Thank you for the update on the new ms-be hosts! Once one/two of those hosts are online we'll start decom the old hosts, I'm not too worried about sdb (the SSD) although please proceed with the BBU swap so we have the host healthy again. Host can be powered down cleanly at any time, let me know on irc when you'd like to do the work and I can power off the host!

Cmjohnson added a subscriber: wiki_willy.

I swapped the bbu with one from a decom'd ms-be host. The server shutdown during the boot process. I put the old bbu back in and the server booted okay. If @fgiunchedi needs this server then we need to purchase a new battery from HP. assigning to @wiki_willy for the next steps.

Update: the host isn't coming back (both mgmt and ssh) but yes given we'll need a BBU for ms-be1030 (T268036) too I'd say let's order some (?). The host will be fully decom'd in maybe 8-10 weeks but I think we can keep the bbu anyway post-decom

wiki_willy added a parent task: Unknown Object (Task).

Request for replacement BBU placed via T268061

Mentioned in SAL (#wikimedia-operations) [2020-11-24T15:07:25Z] <godog> swift eqiad-prod: decom ms-be1022 ssd from swift - T267870

I've disabled the handler to avoid further duplicate tasks, we'll need to remember to enable it post-maintenance https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be1022&service=HP+RAID

@fgiunchedi The battery has been replaced. The SSD looks to be /dev/sda and is an SSD. What do you want to do about the failed disk?

@fgiunchedi The battery has been replaced. The SSD looks to be /dev/sda and is an SSD. What do you want to do about the failed disk?

We can live with the failed SSD until the host is fully decom in a few week's time

Mentioned in SAL (#wikimedia-operations) [2020-12-10T16:28:28Z] <godog> power reset ms-be1022 - stuck after boot - T267870

The disk error did not come back

Resolving this, if the error returns please re-open

Mentioned in SAL (#wikimedia-operations) [2021-01-26T09:32:26Z] <godog> disable mdadm check emails on ms-be1022 / known, and host is going to be decom'd - T267870