The host crashed on 20191210T19:33. We might want to apply the same firmware update that we applied to cp3053 in T239041. Perhaps just update all hosts? CC @MoritzMuehlenhoff @RobH .
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T238305 Servers freezing across the caching cluster | |||
Duplicate | • ema | T240425 cp3055 crashed |
Event Timeline
Given that the firmware updates itself were still showing these symptons, this wouldn't hurt, but I doubt it's a complete fix, I wrote up some proposal at https://phabricator.wikimedia.org/T238305#5731421, let's proceed there.
I was tagged into this, so I'm guessing the info is needed for firmware?
The server is running the following:
Bios 2.2.11 - this is very outdated, urgent flagged update currently is 2.4.8
ilom 3.34.34.34 - this is outdated, current version is 4.00.00.00.
This data is available using the service tag on support.dell.com without any login (anyone can pull this info, so I'm not a blocker on this ; )
I'm not going to touch the firmware though unless a task is very specifically assigned to me to do so, as it seems troubleshooting is ongoing and I don't want to mess anything up.
@RobH Indeed this would need owner confirmation and depooling, not asking you to do anything. Was tagging you just to confirm a remote upgrade was possible and reasonable for 3xxx datacenter, given its particular location. It could be done by me or ema if you don't have any suggestion against it. Also ema is the right person right now to decide followup.
We've experienced good results remotely flashign the bios and ilom via the https drac interface (via ssh tunnel for https proxy). Then I upload via the servers https mgmt interface the ilom first (since updated mgmt interface usually results in better updates) and then updating bios. The entire .exe download file from support.dell.com is uploaded into the server (no unpacking required) and the mgmt interface handles it. Happy to help on this if/when the time comes.
Mentioned in SAL (#wikimedia-operations) [2019-12-21T23:12:13Z] <volans> powercycle cp3055 - T240425
Nothing on the host logs either. For the record it crashed 7 minutes after cp3051 (see T241306) and both are part of the upload esams cluster.
Like cp3051 this one too logged a bunch of kvm: disabled by bios during boot.
Mentioned in SAL (#wikimedia-operations) [2020-01-07T23:02:37Z] <cdanis> cp3055.mgmt% racadm serveraction powercycle T240425
Nothing in racadm getsel or racadm lclog view (latter just has me logging in over SSH).