Page MenuHomePhabricator

Debug HP raid cache disabled errors on ms-be1019/20/21
Closed, ResolvedPublic

Description

  • reenable alert handler in icinga for ms-be1019 / HP RAID once this is resolved

There's a warning for HP RAID on ms-be1021, WARNING: Slot 3: Recharging: Battery/Capacitor - OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller

The battery status is indeed charging, though hpssacli detail also shows a cache error (cable error). Opening a task just in case it is a real error and not just the battery charging, afaict this is the first time we've come across it on ms-be machines

root@ms-be1021:~# hpssacli controller slot=3 show detail

Smart Array P840 in Slot 3
   Bus Interface: PCI
   Slot: 3
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 3.00
   Rebuild Priority: High
   Expand Priority: Medium
   Surface Scan Delay: 3 secs
   Surface Scan Mode: Idle
   Parallel Surface Scan Supported: Yes
   Current Parallel Surface Scan Count: 4
   Max Parallel Surface Scan Count: 16
   Queue Depth: Automatic
   Monitor and Performance Delay: 60  min
   Elevator Sort: Enabled
   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 15 secs
   Cache Board Present: True
   Cache Status: Permanently Disabled
   Cache Status Details: Cable Error
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.8 GB
   No-Battery Write Cache: Disabled
   SSD Caching RAID5 WriteBack Enabled: True
   SSD Caching Version: 2
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: Recharging
   SATA NCQ Supported: True
   Spare Activation Mode: Activate on physical drive failure (default)
   Controller Temperature (C): 87
   Cache Module Temperature (C): 57
   Number of Ports: 2 Internal only
   Encryption: Disabled
   Express Local Encryption: False
   Driver Name: hpsa
   Driver Version: 3.4.0
   Driver Supports HP SSD Smart Path: False
   PCI Address (Domain:Bus:Device.Function): 0000:08:00.0
   Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
   Controller Mode: SmartArray
   Controller Mode Reboot: Not Required
   Latency Scheduler Setting: Disabled
   Current Power Mode: MaxPerformance

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2017, 10:49 AM

Looks like now battery count is reported as zero and Cache Status: Permanently Disabled plus Cache Status Details: Cable Error are still active, though the hp raid check reports OK

Smart Array P840 in Slot 3
   Bus Interface: PCI
   Slot: 3
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 3.00
   Rebuild Priority: High
   Expand Priority: Medium
   Surface Scan Delay: 3 secs
   Surface Scan Mode: Idle
   Parallel Surface Scan Supported: Yes
   Current Parallel Surface Scan Count: 4
   Max Parallel Surface Scan Count: 16
   Queue Depth: Automatic
   Monitor and Performance Delay: 60  min
   Elevator Sort: Enabled
   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 15 secs
   Cache Board Present: True
   Cache Status: Permanently Disabled
   Cache Status Details: Cable Error
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.8 GB
   No-Battery Write Cache: Disabled
   SSD Caching RAID5 WriteBack Enabled: True
   SSD Caching Version: 2
   Battery/Capacitor Count: 0
   SATA NCQ Supported: True
   Spare Activation Mode: Activate on physical drive failure (default)
   Controller Temperature (C): 88
   Cache Module Temperature (C): 57
   Number of Ports: 2 Internal only
   Encryption: Disabled
   Express Local Encryption: False
   Driver Name: hpsa
   Driver Version: 3.4.0
   Driver Supports HP SSD Smart Path: False
   PCI Address (Domain:Bus:Device.Function): 0000:08:00.0
   Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
   Controller Mode: SmartArray
   Controller Mode Reboot: Not Required
   Latency Scheduler Setting: Disabled
   Current Power Mode: MaxPerformance

@fgiunchedi Do you need me to do anything with this yet?

@Cmjohnson yeah it looks like a hw raid controller cache of some sorts? We can coordinate some downtime maybe next week after the switchover to debug (or reseat?) further, have you seen this error before?

Cache Board Present: True
Cache Status: Permanently Disabled
Cache Status Details: Cable Error
Cache Ratio: 10% Read / 90% Write
Drive Write Cache: Disabled
fgiunchedi moved this task from Backlog to Blocked on the User-fgiunchedi board.May 3 2017, 9:48 AM

The same recharging status and icinga warning is showing on ms-be1020 now too

=> show status

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: Permanently Disabled
   Battery/Capacitor Status: Recharging

Ditto on ms-be1019 now

Cache Status: Permanently Disabled
Cache Status Details: Cable Error
Cache Ratio: 10% Read / 90% Write
Drive Write Cache: Disabled
Total Cache Size: 4.0 GB
Total Cache Memory Available: 3.8 GB
No-Battery Write Cache: Disabled
SSD Caching RAID5 WriteBack Enabled: True
SSD Caching Version: 2
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: Recharging

@Cmjohnson have you seen this error before? namely:

root@ms-be1021:~# hpssacli controller slot=3 show status

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: Permanently Disabled

and in show detail

Cache Board Present: True
Cache Status: Permanently Disabled
Cache Status Details: Cable Error
Cache Ratio: 10% Read / 90% Write
Drive Write Cache: Disabled

We can schedule downtime next week on ms-be1021 to debug/reseat/investigate

A case has been opened for this server. Let's work this one and them move on to the others...ms-be1016, 1019 and 1020 should be included in the list.

Your case was successfully submitted. Please note your Case ID: 5319707302 for future reference.
An email confirmation will be sent to the case contact. Hewlett Packard Enterprise will contact you to begin work on your problem based on your contract or warranty coverage. Clicking image will open a new window displaying help information.

fgiunchedi renamed this task from HP RAID icinga alert on ms-be1021 to Debug HP raid cache disabled errors on ms-be1019/20/21.May 16 2017, 8:09 AM

Logs sent to HP, they're most likely going to want to do a f/w upgrade first.

faidon added a subscriber: faidon.May 17 2017, 12:30 AM

@Cmjohnson I have heard of batteries issues from other HPE users. Could you do a visual inspection of the battery on those systems and see whether they're swollen or look damaged in any other way? (not kidding, I've heard this kind of thing happening…). If you have a multimeter, it may also be useful to measure them and see whether they're dead already.

HP is sending me a new battery and wants me to upgrade the f/w.
Part/s shipped: 871264-001
Part description: SPS-BATT PACK MC 96W V3
Carrier Name: UPSN
Tracking Number: 1ZA7Y0140158609823

@Cmjohnson sounds good! let's try that on ms-be1019 on Tues

@fgiunchedi Hey missed that Tuesday....let's do this Thrusday morning but has to be 1021...the ticket with Dell is 1021 and I have to be consistent in case they need logs.

@Cmjohnson sure let's try ms-be2021 today, ping me on IRC

Mentioned in SAL (#wikimedia-operations) [2017-05-25T14:08:19Z] <godog> shut ms-be1021 for BBU replacement - T163777

After BBU replacement the error seems to be gone from ms-be1021:

root@ms-be1021:~# hpssacli controller slot=3 show detail

Smart Array P840 in Slot 3
   Bus Interface: PCI
   Slot: 3
   Serial Number: XXX
   Cache Serial Number: XXX
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 3.00
   Rebuild Priority: High
   Expand Priority: Medium
   Surface Scan Delay: 3 secs
   Surface Scan Mode: Idle
   Parallel Surface Scan Supported: Yes
   Current Parallel Surface Scan Count: 4
   Max Parallel Surface Scan Count: 16
   Queue Depth: Automatic
   Monitor and Performance Delay: 60  min
   Elevator Sort: Enabled
   Degraded Performance Optimization: Disabled
   Inconsistency Repair Policy: Disabled
   Wait for Cache Room: Disabled
   Surface Analysis Inconsistency Notification: Disabled
   Post Prompt Timeout: 15 secs
   Cache Board Present: True
   Cache Status: OK
   Cache Ratio: 10% Read / 90% Write
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.8 GB
   No-Battery Write Cache: Disabled
   SSD Caching RAID5 WriteBack Enabled: True
   SSD Caching Version: 2
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: True
   Spare Activation Mode: Activate on physical drive failure (default)
   Controller Temperature (C): 61
   Cache Module Temperature (C): 44
   Number of Ports: 2 Internal only
   Encryption: Disabled
   Express Local Encryption: False
   Driver Name: hpsa
   Driver Version: 3.4.0
   Driver Supports HP SSD Smart Path: False
   PCI Address (Domain:Bus:Device.Function): 0000:08:00.0
   Negotiated PCIe Data Rate: PCIe 3.0 x8 (7880 MB/s)
   Controller Mode: SmartArray
   Controller Mode Reboot: Not Required
   Latency Scheduler Setting: Disabled
   Current Power Mode: MaxPerformance
   Host Serial Number: XXX

root@ms-be1021:~# hpssacli controller slot=3 show status

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

Support cases for both ms-be1019 and 1020 have been opened with HPE

Your case was successfully submitted. Please note your Case ID: 5320104843 for future reference.

Your case was successfully submitted. Please note your Case ID: 5320104976 for future reference.

@fguinchedi he batteries for ms-be1020 and 1019 are on-site...please let me know when you want to swap them

ms-be1020 had its bbu swapped, error cleared:

# /usr/local/lib/nagios/plugins/check_hpssacli 
OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK

We were getting duplicate alerts from ms-be1019 due to its hp raid check going unknown (I think). I've disabled the handler for hp raid on ms-be1019 though it'll need to be reenabled once this is fixed.

fgiunchedi updated the task description. (Show Details)Jun 9 2017, 8:41 AM

@fgiunchedi the bbu finally shipped. i will ping you once it arrives to swap

Hewlett Packard Enterprise Reference Number: 5320104843

STATUS: Customer Self Repair Part has been shipped

Part/s shipped: 871264-001
Part description: SPS-BATT PACK MC 96W V3
Carrier Name: UPSN
Tracking Number: 1za7y0140158864388

Marostegui triaged this task as Medium priority.Jun 14 2017, 7:10 AM

@Cmjohnson today sounds good, ping me here or on IRC

fgiunchedi closed this task as Resolved.Jun 21 2017, 8:32 AM
fgiunchedi updated the task description. (Show Details)

All done, 1019 BBU was swapped yesterday by @Cmjohnson