Page MenuHomePhabricator

ms-be1016 controller cache failure
Closed, ResolvedPublic

Description

Got this while diagnosing the UNKNOWN check for ms-be1016

root@ms-be1016:~# hpssacli controller all show

Smart Array P840 in Slot 1                (sn: PDNNF0ARH7Z0QD)

CACHE STATUS PROBLEM DETECTED: The cache on this controller has a problem.
                               To prevent data loss, configuration changes to
                               this controller are not allowed.
                               Please replace the cache to be able to continue 
                               to configure this controller.

I suspect this needs physical investigation. While the data on the machine itself is redundant, if we plan any long (>= some days) then we should rebalance the cluster and migrate data off the machine itself.

Related Objects

Event Timeline

@fgiunchedi I am going to need to take this server down and remove and re-assemble the controller. Let me know when I can do this.

Thanks

@Cmjohnson can be done at any time as long as a graceful shutdown is used, thanks!

@fgiunchedi I removed and reassembled the raid card. Rebooted please take a look and lmk if you see anything unusual.

@Cmjohnson looks like that did it, thanks!

hpssacli controller all show shows OK now. In the process somehow sdc filesystem went unhappy (xfs_admin -l was hanging using all CPU) I've reinitialized the disk and it is rebuilding

@Cmjohnson looks like we're seeing this again on ms-be1016 :(

root@ms-be1016:~# hpssacli controller all show

Smart Array P840 in Slot 1                (sn: XXX)

CACHE STATUS PROBLEM DETECTED: The cache on this controller has a problem.
                               To prevent data loss, configuration changes to
                               this controller are not allowed.
                               Please replace the cache to be able to continue 
                               to configure this controller.

@RobH: The s/n MXQ50702Q5 for ms-be1016 is not showing up as having a contract or a warranty. Could you look into this please. Thanks

The S/N finally shows up..i submitted a case for this

Your case was successfully submitted. Please note your Case ID: 5318424916 for future reference.

New disk controller arrived...spoke with @fgiunchedi and we'll take care of this in a couple of weeks when he gets back from vacation.

@fgiunchedi I would like to do this today if possible. HP is harassing me about returning the part

@Cmjohnson I'm ok to do this today, LMK when it is a good time for you

Mentioned in SAL (#wikimedia-operations) [2017-04-25T14:02:28Z] <godog> poweroff ms-be1016 for controller swap - T150206

Cmjohnson raised the priority of this task from High to Needs Triage.Apr 25 2017, 8:16 PM

Received the new controller card, installed it and the server would not boot to the logical drives. I booted into the raid bios and see that the card is there, the logical drives did not change. The error I received right before it tried to pxe boot was

304-Keyboard or System Unit Error.

That error is ambiguous and doesn't suggest much other than a failed component (according to the HP website)

I reseated the card, verified legacy bios was selected and the raid card was the primary boot device. Still failed. I ended up putting the old raid card back in and the server boot w/out issue. This will require a phone call to HP because it could either be a bad card or I missed something.

Returning the part they sent and need to contact HP that it did not work and have them send a new part or a tech to look at it.

return shipping tracking UPS 1Z 422 2AR 90 5200 6397

@fgiunchedi The new raid battery is here...let me know when it's safe to turn off and replace.

The battery was replaced.