Page MenuHomePhabricator

ms-be2025 controller failure
Closed, ResolvedPublic

Description

ms-be2025 is down since 2016-11-19 10:07:29. The iLO log has the following entries

severity=Critical
date=11/11/2016
time=22:18
description=Drive Array Controller Failure (Slot 3)

severity=Caution
date=11/11/2016
time=22:27
description=Option ROM POST Error: 1719-Slot 3 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Action: Install the latest controller firmware. If the problem persists, replace the controller.

severity=Caution
date=[NOT SET]
time=
description=Option ROM POST Error: 1792-Slot 3 Drive Array - Valid Data Found in Write-Back Cache. Data will automatically be written to drive array. Action: No action required.

severity=Caution
date=11/19/2016
time=10:05
description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support

Probably RMA time for the controller ?

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptNov 21 2016, 10:54 AM

Mentioned in SAL (#wikimedia-operations) [2016-11-22T00:52:01Z] <godog> reboot ms-be2025 T151201

fgiunchedi assigned this task to Papaul.Nov 22 2016, 1:00 AM
fgiunchedi added a project: ops-codfw.
fgiunchedi added a subscriber: fgiunchedi.

Indeed looks like battery/cache failure, I've rebooted ms-be2025 and it came up fine modulo the disclaimer above for POST error. hpssacli:

Smart Array P840 in Slot 3
   Bus Interface: PCI
   Slot: 3
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 3.56
   Rebuild Priority: High
   Expand Priority: Medium
   Surface Scan Delay: 3 secs
   Surface Scan Mode: Idle
   Parallel Surface Scan Supported: Yes
   Current Parallel Surface Scan Count: 4
   Max Parallel Surface Scan Count: 16
   Queue Depth: Automatic
   Monitor and Performance Delay: 60  min
   Elevator Sort: Enabled
   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 15 secs
   Cache Board Present: True
   Cache Status: Permanently Disabled
   Cache Status Details: Cache disabled; battery/capacitor is not attached
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.8 GB
   No-Battery Write Cache: Disabled
   SSD Caching RAID5 WriteBack Enabled: True
   SSD Caching Version: 2
   Battery/Capacitor Count: 0
   SATA NCQ Supported: True
   Spare Activation Mode: Activate on physical drive failure (default)
   Controller Temperature (C): 79
   Cache Module Temperature (C): 58
   Number of Ports: 2 Internal only
   Encryption: Disabled
   Express Local Encryption: False
   Driver Name: hpsa
   Driver Version: 3.4.14
   Driver Supports HP SSD Smart Path: True
   PCI Address (Domain:Bus:Device.Function): 0000:08:00.0
   Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
   Controller Mode: SmartArray
   Controller Mode Reboot: Not Required
   Latency Scheduler Setting: Disabled
   Current Power Mode: MaxPerformance
Papaul triaged this task as High priority.Nov 23 2016, 1:47 AM

@Papaul I think the culprit might be a failing / failed cache module, see also

Cache Status: Permanently Disabled
Cache Status Details: Cache disabled; battery/capacitor is not attached

It could also be the controller as a whole, the machine can be taken down at any time for diagnostics / reseating but please do a graceful shutdown.

Papaul added a comment.Dec 1 2016, 5:47 PM

Dear Mr Papaul Tshibamba,

Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.

Your request is being worked on under reference number 5315406363
Status: Case is generated and in Progress

Product description: HPE ProLiant DL380 Gen9 12LFF Configure-to-order Server
Product number: 719061-B21
Serial number: MXQ62300TV
Subject: HPE ProLiant DL380 Gen9 - cache permanently disabled

Yours sincerely,
Hewlett Packard Enterprise

Note: Will have a Tech from HP on site Monday.

Papaul added a comment.Dec 6 2016, 4:08 PM

HP didn't have the replacement part. They called me this morning to let me know that they do have the part now and a Tech is schedule to me onsite tomorrow Dec. 7th between 10AM and 1PM.
@fgiunchedi can you please setup a maintenance window for this server tomorrow between 10AM and 1PM?

Thanks

Papaul added a comment.Dec 7 2016, 5:05 PM
  • RAID Controller and battery replacement complete.
  • Clean all logs

Leaving this task open for now

Papaul lowered the priority of this task from High to Low.Dec 14 2016, 2:05 AM
Papaul closed this task as Resolved.Jan 25 2017, 1:55 AM

It has been more than a month now this system is up and running with now problem. I am resolving this task.