Page MenuHomePhabricator

Degraded RAID on cloudvirt1019
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (hpssacli) was detected on host labvirt1019. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: Slot 0: no logical drives --- Slot 0: no drives --- Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0

$ sudo /usr/local/lib/nagios/plugins/get_raid_status_hpssacli

Error: The specified device does not have any logical drives.

Smart Array P840 in Slot 1

   array A

      Logical Drive: 1
         Size: 7.3 TB
         Fault Tolerance: 1+0
         Strip Size: 256 KB
         Full Stripe Size: 1280 KB
         Status: OK
         MultiDomain Status: OK
         Caching:  Disabled
         Disk Name: /dev/sda 
         Mount Points: / 85.7 GB Partition Number 2
         OS Status: LOCKED
         Mirror Group 1:
            physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:1 (port 2I:box 1:bay 1, Solid State SATA, 1600.3 GB, OK)
         Mirror Group 2:
            physicaldrive 2I:1:2 (port 2I:box 1:bay 2, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:3 (port 2I:box 1:bay 3, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:1:4 (port 2I:box 1:bay 4, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 1600.3 GB, OK)
            physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 1600.3 GB, OK)
         Drive Type: Data
         LD Acceleration Method: HPE SSD Smart Path

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I went ahead and disabled the unused RAID controller in the BIOS. I have confirmed is not enough to clear the monitor. The lack of battery still reads as "critical".

Now that the spam is done from the last vandalism, @RobH, I am curious what can be done about the battery. There is some quirky history regarding the array here, but I figure we probably need to buy the battery either way. The raid card for this server and its partner were shipped without the "optional" cache backup battery. This is at least one reason there are degraded RAID alerts for them.

I've emailed our Dasher/HP team to investigate why they didn't include raid controller batteries on these particular systems. Will udpate when I have a reply.

Right now it seems the systems work but have the warning, so this isn't a critical downtime, but still high priority. If this isn't the case, please correct me!

Ok, Dasher/HP states these shipped with battery systems already in place on the mainboard for the raid controllers, and have attached a file for review.

Since the pdf of the email has email address and contact info, I've had to set it to restricted view to members of the #acl*operations-team.

@Cmjohnson: Can you work to schedule downtime on labvirt1019 with @Bstorm and follow the PDF for checking for the physical existence of the raid controller battery? Dasher/HP states it shipped with the systems.

Please note that tasks T194855 (labvirt1020) & T196507 (labvirt1019) both are from the same order, same issues, and need the same checks done.

@RobH I am circling back to the labvirts and the new controllers did include batteries and they are connected to the cards. They were the exact same battery as the old card.

I created a ticket with HP....this should be fun

Case ID: 5331584481

HP is sending me a replacement battery...should be here sometime today or early tomorrow (8/10)

@Bstorm I have the new battery on-site...when is a good time for you to replace?

I can stop the VMs on labvirt1019 and 1020, silence alerts and shut them down whenever you like :) @Cmjohnson

@RobH I cannot download the service pack to upgrade the firmware for this server. Can you please try and also reach out to our rep and link my HPE account with all of our servers so I can download firmware updates when needed please. The download you need is SPP : http://h17007.www1.hpe.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx The support case number is 5331584481. Thanks

This is downloaded and in my home directory on bast1002.wikimedia.org. You can copy it over to your personal directory or just sudo copy down from mine!

This comment was removed by Bstorm.
Andrew renamed this task from Degraded RAID on labvirt1019 to Degraded RAID on cloudvirt1019.Sep 11 2018, 1:23 AM

I renamed these servers but they're still complaining about missing batteries.

@andrewbogott and @Bstorm I ran the HP Service pack on this server, several things were updated including the raid card firmware. Please let me know if the problem with the battery not being present is fixed.

icinga shows it recharging

WARNING: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: Recharging

icinga still shows battery recharging....let's give it the weekend

Updated HP that it still is giving me the same "recharging" message 4 days later.

sent HP an updated AHS log at their request

Thank you for keeping touch with HP about this. This is dumb :(

latest update from HP...they are sending a new cable

Hello Chris,

Thankyou for uplaoding the AHS logs, below are the findings

[2018-08-14 19:41:18] INFO: Smart Storage Battery state change: cable error (0x4, 0x0, 0x1)

Cache module detected, type=0x2
ROM Layout ID for PIC: 20
PIC version in ROM=04
Super-cap Status:
Super-cap not attached
PIC running firmware version 04
Flash Memory is present

Firmware version

P840 Array Controller in slot 1
Firmware: 6.60
10 x 1.6 TB SATA SSD Hard Drive(s)

Plan of action

We see that there is a cable error logged in the logs even after updating the controller firmware and replacing battery. We believe that the part shipped could be a DOA

We are shipping the smart storage battery again, please let us know the status of the server after replacing battery.

Mentioned in SAL (#wikimedia-operations) [2018-09-27T14:48:48Z] <cmjohnson1> disabling checks on cloudvirt1019 to replace raid controller cable T196507

Mentioned in SAL (#wikimedia-operations) [2018-09-27T15:03:58Z] <arturo> T196507 2h downtime cloudvirt1019 in icinga

Sorry, they sent another new battery. Swapped the battery and let's see if it gets beyond recharging status

Updated HP that the status remains the same and that the 3rd battery they sent us still does not fix the problem.

Mentioned in SAL (#wikimedia-operations) [2018-10-10T15:55:08Z] <cmjohnson1> scheduled downtime for host cloudvirt1019 swap raid card T196507

received the new raid controller and installed, updating the firmware now. Initially it is showing as failed raid

F/W updated and now I am getting new issues...missing several of the disks. I have to get another AHS report and send to HP....the saga continues

Here is where we are with this server....

  • initial order had the wrong raid controller, didn't see all 10 disks
  • received the new raid controller but then we started getting bad battery errors
  • swapped the battery 3 times from HPE
  • HPE sent a new raid card and now we're getting disk errors.

Below is the latest response from HPE on what to do next.

gsd_csc_case_mngmt@hpe.com via 85p7f8ux5n0ojp.d-bulkeau.na86.bnc.salesforce.com
7:47 AM (3 hours ago)
to alan.fernandez2@hpe.com, me, case_monitoring@hpe.com, varsham.morey@hpe.com

Hi Chris,

Please find Our analysis and the action Plan,
BIOS:


System ROM: 05/21/2018

BIOS Version: 2.60(P89)

iLO4


Firmware Version: 2.60

Storage:


Controller in slot 1: P840

Firmware: 6.60

Location Port,Box,Bay Model Serial Number Firmware Capacity Vendor


Slot 1 2I,1,1 LK1600GEYMV BTHC725202VG1P6PGN HPG2 1.6 TB Intel

Slot 1 2I,1,2 LK1600GEYMV BTHC72640DCE1P6PGN HPG2 1.6 TB Intel

Slot 1 2I,1,3 LK1600GEYMV BTHC72640DUX1P6PGN HPG2 1.6 TB Intel

Slot 1 2I,1,4 LK1600GEYMV BTHC72640DXP1P6PGN HPG2 1.6 TB Intel

Slot 1 2I,2,1 LK1600GEYMV BTHC725404TD1P6PGN HPG2 1.6 TB Intel

Slot 1 2I,2,2 LK1600GEYMV BTHC725202BL1P6PGN HPG2 1.6 TB Intel

IML logs:


Informational,521,271,0x000A,POST Message,,,10/10/2018 17:37:06,69: Option ROM POST Information: Action: Use HP SSA to identify and troubleshoot errors or find drives to replace.

Critical,521,1827,0x0013,Drive Array,,,10/10/2018 17:38:16,70: Internal Storage Enclosure Device Failure (Bay 5, Box 1, Port 1I, Slot 1)

Critical,521,1829,0x0013,Drive Array,,,10/10/2018 17:38:16,71: Internal Storage Enclosure Device Failure (Bay 6, Box 1, Port 1I, Slot 1)

Critical,521,1831,0x0013,Drive Array,,,10/10/2018 17:38:16,72: Internal Storage Enclosure Device Failure (Bay 7, Box 1, Port 1I, Slot 1)

Critical,521,1834,0x0013,Drive Array,,,10/10/2018 17:38:16,73: Internal Storage Enclosure Device Failure (Bay 8, Box 1, Port 1I, Slot 1)

Findings:


  • There are 4 drives shows failed as per the IML logs. does not show in storage section.
  • However, the probability of getting 4 drives failed at the same time is very minimal.
  • The Storage section from the AHS log is incomplete.
  • We would like to check HPE SSA ( Smart Storage administrator) offline to check the status of the drives.

Action plan:


Downtime Required

  1. Reboot the server>> F10 >> Intelligent Provisioning
  1. Select Smart Storage administrator and verify the drives.
  1. Generate the ADU/WearGuage report for further analysis.

The update, I received an email from HP last night, they are sending 4 new disks.

The latest update, HPE sent me for new ssds, I replaced the SSDs and they disks are showing up as bad in the raid cfg. Maybe the backplane change is required.

I have not heard back from HP yet, I pinged them again

HP wanted me to reseat the sata cables which I did, and now all 10 disks are showing again but we're back to the original issue of the raid battery not fully charging. The amount of time and energy spent going full circle is frustrating. We're no closer to having this solved then we were once we received the new raid controller cards.

just got off the phone with HP and they are stating that they are not seeing any issues with the raid battery in the logs I have sent. They suggest it's our reporting tool.

Question: what is the warranty status of this server? would it make sense to get a more complete replacement by HP? (not just some spare pieces like disk and raid controllers)

@aborrero Unfortunately it's not that simple. Once we take delivery of a
server we then have to work through technical support. We may be at the
point where the issue needs to be escalated but I'm waiting on HPE.

Mentioned in SAL (#wikimedia-operations) [2018-11-29T15:35:19Z] <gtirloni> T196507 downtimed and powercycled cloudvirt1019

If the batter is installed and, as the HPE advisories suggest, the firmwares have been updated _and_ we have many other servers with this controller that are working fine... it seems to come down the a battery defect.

cloudvirt1019:

# hpssacli ctrl all show detail | grep -Ei '(cache|battery)'
   Cache Serial Number: PEYFP0BRH9H6EP
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: Not Configured
   Cache Ratio: 100% Read / 0% Write
   Read Cache Size: 0 MB
   Write Cache Size: 0 MB
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.2 GB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: Recharging <======== same status even after power cycling
   Cache Module Temperature (C): 45

cloudvirt1020:

# hpssacli ctrl all show detail | grep -Ei '(cache|battery)'
   Cache Serial Number: PEYFP0BRH80120
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: Not Configured
   Cache Ratio: 100% Read / 0% Write
   Read Cache Size: 0 MB
   Write Cache Size: 0 MB
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.8 GB
   No-Battery Write Cache: Disabled
   Battery/Capacitor Count: 0  <============= no battery?
   Cache Module Temperature (C): 36

Would it be possible to shutdown both servers and swap their batteries to see if the problems change sides?

Do we have a spare server with the same controller and a working battery to temporarily use on these cloudvirts as a test?

All failing, I think we need yet another battery replacement it would seem.

@RobH @Cmjohnson Can we get a technician from HP on site with various parts (cards, batteries, etc) to try and fix this?

@GTirloni This has been an ongoing thing since August, I have replaced the battery 3 maybe 4 times already. Replaced the raid controller once and replaced 4 SSDs.

Also, yesterday 5 Dec 2018, HPE sent a technician and replaced the motherboard, the backplane and battery again and today 6 December the battery status is still in constant "charging" state. I am not sure what else I can do at this point. I have spent months and replaced almost every part on the server. I am open for suggestions but I am quite confident that it's not a faulty battery.

@faidon this probably needs to be escalated...can we get this server replaced. the issue stems from them selling us a server with a raid card that could not handle the number of disks installed and had to send us a different raid card.

@Cmjohnson am I correct in understanding that cloudvirt1020 has the exact same issue? Or has that been resolved somehow?

@Andrew yes, you are correct it is the same exact issue. My goal was to work with one, figure out the issue and then go to HPE with a solution but that obviously is not working out so great.

Change 478058 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1019: reimage with Stretch

https://gerrit.wikimedia.org/r/478058

Change 478058 merged by GTirloni:
[operations/puppet@production] cloudvirt1019: reimage with Stretch

https://gerrit.wikimedia.org/r/478058

@Andrew yes, you are correct it is the same exact issue. My goal was to work with one, figure out the issue and then go to HPE with a solution but that obviously is not working out so great.

That sounds perfectly sensible to me -- just double-checking.

Mentioned in SAL (#wikimedia-operations) [2018-12-06T20:48:47Z] <gtirloni> reimaging cloudvirt1019 with stretch T196507

Stretch did not help, battery continues showing as recharging.

Smart Array P440ar in Slot 0 (Embedded)
   Cache Serial Number: PDNLH0BRH8227E
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: Not Configured
   Drive Write Cache: Disabled
   Total Cache Size: 2.0 GB
   Total Cache Memory Available: 1.8 GB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   Cache Module Temperature (C): 32
   Driver Supports HPE SSD Smart Path: True
Smart Array P840 in Slot 1
   Cache Serial Number: PEYFP0BRH9H6EP
   Wait for Cache Room: Disabled
   Cache Board Present: True
   Cache Status: Not Configured
   Cache Ratio: 100% Read / 0% Write
   Read Cache Size: 0 MB
   Write Cache Size: 0 MB
   Drive Write Cache: Disabled
   Total Cache Size: 4.0 GB
   Total Cache Memory Available: 3.2 GB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: Recharging
   Cache Module Temperature (C): 47
   Driver Supports HPE SSD Smart Path: True

Change 478098 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1019: reimage with Jessie

https://gerrit.wikimedia.org/r/478098

Change 478098 merged by GTirloni:
[operations/puppet@production] cloudvirt1019: reimage with Jessie

https://gerrit.wikimedia.org/r/478098

Mentioned in SAL (#wikimedia-operations) [2018-12-06T21:39:25Z] <gtirloni> reimaging cloudvirt1019 with jessie T196507

Change 478115 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Disable alerting on cloudvirt1019 and 1020

https://gerrit.wikimedia.org/r/478115

Change 478115 merged by Andrew Bogott:
[operations/puppet@production] Disable alerting on cloudvirt1019 and 1020

https://gerrit.wikimedia.org/r/478115

OK, I had a look at this. A few observations first of all:

  • While not 100% sure, I don't think this is related to the controller having been swapped before. I don't think it fits.
  • cloudvirt1019 & cloudvirt1002 exhibit different symptoms at the moment. 1019 (which @Cmjohnson has been focusing on) shows its battery count as 1 but status as "recharging", while 1020 as having no battery (count = 0).

Other than that... I suspect this whole investigation may be moot... The battery is only useful for the cache (aka BBU), but this is a system that has only SSDs, so the cache is disabled by default in favor of HP SSD Smart Path. I confirmed this was the case: ctrl=1 show showed Cache Status: Not Configured and Cache Ratio: 100% Read / 0% Write. Our alerts handle the SSD Smart Path case already and don't alert on "Cache: Not Configured" but they do alert on battery errors (either battery count 0, or battery status "recharging"), unless passed --no-battery, which is not currently used by any host right now.

I had a theory that the battery may stay at "recharging" because the cache isn't being used. So I ran this on cloudvirt1019, to disable the SSD Smart Path and enable caching:

ctrl slot=1 array A modify ssdsmartpath=disable
ctrl slot=1 logicaldrive 1 modify caching=enable

The output now is:

Cache Status: Temporarily Disabled
Cache Status Details: Cable Error
Cache Ratio: 10% Read / 90% Write
[...]
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: Recharging

…which is at least more information ("cable error").

In terms of next steps I'd say:

  • Check cloudvirt1019's BBU and battery cables and/or replace the battery once again
  • Figure out why cloudvirt1020 reports no battery. Having two systems to test against may help understand this issue better and help exclude some of the causes.

The worst case scenario at this point is that we just get rid of the BBU and/or the battery and pass --no-battery to the check, which should have no ill effect given the cache is already by default implicitly disabled. But let's try figuring this out first.

Other than that... please also construct a draft email which includes timeline of the whole saga and email it to me. Either @Cmjohnson, @RobH or myself will then mail it to Dasher, and in any case I'll make sure to follow up and request immediate resolution. Regardless of whether we identify the issue ourselves or not, we should expect a better level of support from the vendor here.

@faidon, who is 'please also construct a draft email' directed to?

@faidon, who is 'please also construct a draft email' directed to?

Sorry, re-reading this I can see now how it was confusing! I meant @Cmjohnson :)

*bump* These servers could still use some love.

I submitted the cable error finding to HPE and will see if they can send me new cables. When they came to replace all the parts they sent the wrong cables for the controller so that was not changed originally. I will update once they send the cable

@faidon, I have a spare battery for cloudvirt1020 and will look at this when I return so you can compare the 2

@faidon and all, it looks like we were missing a connection from the raid card to the riser card. This was not anywhere on the instruction that came with the raid card. Fortunately, I still had one but am missing for cloudvirt1020. I have already started a ticket with HPE and expect to have one in the next day or 2.

cmjohnson@cloudvirt1019:~$ sudo hpssacli ctrl all show detail | grep -Ei '(cache|battery)'

Cache Serial Number: PDNLH0BRH8227E
Wait for Cache Room: Disabled
Cache Board Present: True
Cache Status: Not Configured
Drive Write Cache: Disabled
Total Cache Size: 2.0 GB
Total Cache Memory Available: 1.8 GB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
Cache Module Temperature (C): 32
Cache Serial Number: PEYFP0BRH9H6EP
Wait for Cache Room: Disabled
Cache Board Present: True
Cache Status: OK
Cache Ratio: 10% Read / 90% Write
Drive Write Cache: Disabled
Total Cache Size: 4.0 GB
Total Cache Memory Available: 3.2 GB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
Cache Module Temperature (C): 45

Before these are delivered for implementation, let's make sure that the two systems have identical settings, especially given we've tested various things on them over the past few months. I reverted my SSD Smart Path setting on 1019, but there are still differences; the most important one that I noticed is that in cloudvirt1019 the P440ar is hidden (disabled in BIOS?) but in cloudvirt1020 it's visible. Maybe a factory reset and then manually reapplying the same settings in each?

Also, I think you mentioned that in one of these a disk turned up faulty all of a sudden today; I assume this is also being dealt with alongside the cable, right?

To make it clear that we are now putting this HW into service, I'm closing this task. Please re-open if you have any objection.