Page MenuHomePhabricator

ms-be1023 crashed / Smart Storage Battery failure
Closed, ResolvedPublic

Description

Icinga reported the host down, unresponsive to ping, ssh and nothing in the management console.
The last iLO relative log is:

</>hpiLO-> show /system1/log1/record18

status=0
status_tag=COMMAND COMPLETED
Wed Apr  1 21:36:04 2020



/system1/log1/record18
  Targets
  Properties
    number=18
    severity=Caution
    date=04/01/2020
    time=21:27
    description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support
  Verbs
    cd version exit show

Related Objects

StatusSubtypeAssignedTask
ResolvedJclark-ctr

Event Timeline

Volans triaged this task as Medium priority.Apr 1 2020, 9:52 PM
Volans created this task.

Mentioned in SAL (#wikimedia-operations) [2020-04-01T21:53:14Z] <volans> force-rebooting ms-be1023, unresponsive - T249174

Upon forced reboot the host is back up but Icinga is reporting Cache: Permanently Disabled - Battery count: 0 and the iLO logged an additional message:

</>hpiLO-> show /system1/log1/record19

status=0
status_tag=COMMAND COMPLETED
Wed Apr  1 21:58:09 2020



/system1/log1/record19
  Targets
  Properties
    number=19
    severity=Caution
    date=04/01/2020
    time=21:54
    description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
  Verbs
    cd version exit show

It looks like we need to replace the battery.

fgiunchedi renamed this task from ms-be1023 crashed to ms-be1023 crashed / Smart Storage Battery failure.Apr 2 2020, 9:04 AM
fgiunchedi added a subscriber: Jclark-ctr.

Thanks @Volans for taking a look! Indeed seems like the battery failed, maybe we can try a reseat first @Cmjohnson @Jclark-ctr next time you get a chance? The host *seems* otherwise fine to me in terms of normal operation.

wiki_willy added a subtask: Unknown Object (Task).Apr 3 2020, 12:55 AM

T249296 created for @RobH to order a few spares. Thanks, Willy

@fgiunchedi @Volans i will be on site 4/14/2020. at 10am Est we have limited time on site can we schedule this?

@fgiunchedi @Volans i will be on site 4/14/2020. at 10am Est we have limited time on site can we schedule this?

Yes 10 am EST today sounds good to me, ping me on irc (godog) when good to go and I'll power down the host

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Apr 14 2020, 2:41 PM

Looks like we're back, thanks @Jclark-ctr

Controller Status: OK
Hardware Revision: B
Firmware Version: 6.88
Rebuild Priority: High
Expand Priority: Medium
Surface Scan Delay: 3 secs
Surface Scan Mode: Idle
Parallel Surface Scan Supported: Yes
Current Parallel Surface Scan Count: 1
Max Parallel Surface Scan Count: 16
Queue Depth: Automatic
Monitor and Performance Delay: 60  min
Elevator Sort: Enabled
Degraded Performance Optimization: Disabled
Inconsistency Repair Policy: Disabled
Wait for Cache Room: Disabled
Surface Analysis Inconsistency Notification: Disabled
Post Prompt Timeout: 15 secs
Cache Board Present: True
Cache Status: OK
Cache Ratio: 10% Read / 90% Write
Drive Write Cache: Disabled
Total Cache Size: 4.0 GB
Total Cache Memory Available: 3.2 GB
No-Battery Write Cache: Disabled
SSD Caching RAID5 WriteBack Enabled: True
SSD Caching Version: 2
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True
Spare Activation Mode: Activate on physical drive failure (default)
Controller Temperature (C): 55
Cache Module Temperature (C): 42
Number of Ports: 2 Internal only
Encryption: Disabled
Express Local Encryption: False
Driver Name: hpsa
Driver Version: 3.4.16
Driver Supports HPE SSD Smart Path: True
PCI Address (Domain:Bus:Device.Function): 0000:08:00.0
Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
Controller Mode: RAID
Controller Mode Reboot: Not Required
Latency Scheduler Setting: Disabled