Page MenuHomePhabricator

ms-be1016 controller cache failure
Closed, ResolvedPublic

Description

Got this while diagnosing the UNKNOWN check for ms-be1016

root@ms-be1016:~# hpssacli controller all show

Smart Array P840 in Slot 1                (sn: PDNNF0ARH7Z0QD)

CACHE STATUS PROBLEM DETECTED: The cache on this controller has a problem.
                               To prevent data loss, configuration changes to
                               this controller are not allowed.
                               Please replace the cache to be able to continue 
                               to configure this controller.

I suspect this needs physical investigation. While the data on the machine itself is redundant, if we plan any long (>= some days) then we should rebalance the cluster and migrate data off the machine itself.

Related Objects

Event Timeline

Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptNov 7 2016, 8:25 PM

@fgiunchedi I am going to need to take this server down and remove and re-assemble the controller. Let me know when I can do this.

Thanks

@Cmjohnson can be done at any time as long as a graceful shutdown is used, thanks!

@fgiunchedi I removed and reassembled the raid card. Rebooted please take a look and lmk if you see anything unusual.

fgiunchedi closed this task as Resolved.Nov 17 2016, 6:28 PM

@Cmjohnson looks like that did it, thanks!

hpssacli controller all show shows OK now. In the process somehow sdc filesystem went unhappy (xfs_admin -l was hanging using all CPU) I've reinitialized the disk and it is rebuilding

fgiunchedi reopened this task as Open.Dec 9 2016, 10:53 PM

@Cmjohnson looks like we're seeing this again on ms-be1016 :(

root@ms-be1016:~# hpssacli controller all show

Smart Array P840 in Slot 1                (sn: XXX)

CACHE STATUS PROBLEM DETECTED: The cache on this controller has a problem.
                               To prevent data loss, configuration changes to
                               this controller are not allowed.
                               Please replace the cache to be able to continue 
                               to configure this controller.
fgiunchedi triaged this task as High priority.Dec 9 2016, 11:42 PM

okay... i will have to call HP.

Cmjohnson added a subscriber: RobH.Jan 23 2017, 3:29 PM

@RobH: The s/n MXQ50702Q5 for ms-be1016 is not showing up as having a contract or a warranty. Could you look into this please. Thanks

The S/N finally shows up..i submitted a case for this

Your case was successfully submitted. Please note your Case ID: 5318424916 for future reference.

New disk controller arrived...spoke with @fgiunchedi and we'll take care of this in a couple of weeks when he gets back from vacation.

@fgiunchedi I would like to do this today if possible. HP is harassing me about returning the part

@Cmjohnson I'm ok to do this today, LMK when it is a good time for you

fgiunchedi moved this task from Backlog to Blocked on the User-fgiunchedi board.

Mentioned in SAL (#wikimedia-operations) [2017-04-25T14:02:28Z] <godog> poweroff ms-be1016 for controller swap - T150206

Cmjohnson raised the priority of this task from High to Needs Triage.Apr 25 2017, 8:16 PM

Received the new controller card, installed it and the server would not boot to the logical drives. I booted into the raid bios and see that the card is there, the logical drives did not change. The error I received right before it tried to pxe boot was

304-Keyboard or System Unit Error.

That error is ambiguous and doesn't suggest much other than a failed component (according to the HP website)

I reseated the card, verified legacy bios was selected and the raid card was the primary boot device. Still failed. I ended up putting the old raid card back in and the server boot w/out issue. This will require a phone call to HP because it could either be a bad card or I missed something.

Returning the part they sent and need to contact HP that it did not work and have them send a new part or a tech to look at it.

return shipping tracking UPS 1Z 422 2AR 90 5200 6397

Cmjohnson moved this task from Backlog to Up next on the ops-eqiad board.Apr 27 2017, 8:41 PM
Cmjohnson moved this task from Up next to Not urgent on the ops-eqiad board.May 8 2017, 4:47 PM
fgiunchedi moved this task from Blocked to Radar on the User-fgiunchedi board.Jun 22 2017, 9:25 AM

@fgiunchedi The new raid battery is here...let me know when it's safe to turn off and replace.

Cmjohnson closed this task as Resolved.Aug 30 2017, 2:41 PM

The battery was replaced.