Page MenuHomePhabricator

cp3055 crashed
Open, MediumPublic

Description

The host crashed on 20191210T19:33. We might want to apply the same firmware update that we applied to cp3053 in T239041. Perhaps just update all hosts? CC @MoritzMuehlenhoff @RobH .

Event Timeline

ema triaged this task as Medium priority.Dec 11 2019, 9:19 AM
ema created this task.

Given that the firmware updates itself were still showing these symptons, this wouldn't hurt, but I doubt it's a complete fix, I wrote up some proposal at https://phabricator.wikimedia.org/T238305#5731421, let's proceed there.

jcrespo assigned this task to ema.Dec 11 2019, 5:41 PM
jcrespo added a subscriber: jcrespo.

Proposing merging this ticket into T238305 (or resolve it), unless there is some host-specific tasks pending for cp3055, like upgrading the firmware and assigning to someone that could do that (@RobH remotely?). Ema, thoughts on this?

RobH added a comment.Dec 11 2019, 5:47 PM

I was tagged into this, so I'm guessing the info is needed for firmware?

The server is running the following:

Bios 2.2.11 - this is very outdated, urgent flagged update currently is 2.4.8
ilom 3.34.34.34 - this is outdated, current version is 4.00.00.00.

This data is available using the service tag on support.dell.com without any login (anyone can pull this info, so I'm not a blocker on this ; )

I'm not going to touch the firmware though unless a task is very specifically assigned to me to do so, as it seems troubleshooting is ongoing and I don't want to mess anything up.

jcrespo added a comment.EditedDec 11 2019, 5:50 PM

@RobH Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3xxx datacenter, given its particular location. It could be done by me or ema if you don't have any suggestion against it. Also ema is the right person right now to decide followup.

RobH added a comment.Dec 11 2019, 5:55 PM

@RobH Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3xxx datacenter, given its particular location. It could be done by me or ema if you don't have any suggestion against it. Also ema is the right person right now to decide followup.

We've experienced good results remotely flashign the bios and ilom via the https drac interface (via ssh tunnel for https proxy). Then I upload via the servers https mgmt interface the ilom first (since updated mgmt interface usually results in better updates) and then updating bios. The entire .exe download file from support.dell.com is uploaded into the server (no unpacking required) and the mgmt interface handles it. Happy to help on this if/when the time comes.

ema moved this task from Triage to Hardware on the Traffic board.Dec 13 2019, 8:46 AM

The host crashed again today, nothing in racadm, checked both getsel and lclog view.

Mentioned in SAL (#wikimedia-operations) [2019-12-21T23:12:13Z] <volans> powercycle cp3055 - T240425

Nothing on the host logs either. For the record it crashed 7 minutes after cp3051 (see T241306) and both are part of the upload esams cluster.

Like cp3051 this one too logged a bunch of kvm: disabled by bios during boot.

Down again, 2020-01-07, 22:44:19ish based on icinga IRC message

Mentioned in SAL (#wikimedia-operations) [2020-01-07T23:02:37Z] <cdanis> cp3055.mgmt% racadm serveraction powercycle T240425

Nothing in racadm getsel or racadm lclog view (latter just has me logging in over SSH).