Page MenuHomePhabricator

an-presto1004 down
Closed, ResolvedPublic

Description

an-presto1004 seems completely unresponsive, I tried a powercycle but it ended up in a timeout :(

-------------------------------------------------------------------------------
Record:      14
Date/Time:   05/23/2020 03:32:18
Source:      system
Severity:    Critical
Description: CPU 1 has a thermal trip (over-temperature) event.
-------------------------------------------------------------------------------

Event Timeline

elukey created this task.May 23 2020, 8:17 AM
Restricted Application added a project: Operations. · View Herald TranscriptMay 23 2020, 8:17 AM

I submitted a ticket with Dell for a replacement CPU. SR1025619583

elukey moved this task from Incoming to Radar on the Analytics board.May 26 2020, 7:35 AM

There is a larger issue with this server, replaced the CPU but noticed the power supplies are both failed. I could also smell burning in the server, swapped the power supplies with decom spares and they psu's started smoking and something definitely burned inside the power supply. This will be down for awhile

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

What's the status here, any feedback from Dell on replacements etc?

@wiki_willy do we have a high level timeline about when we could have the host back in service? We are not in a hurry but it has been down from the end of March :(

Let me check with @Cmjohnson . He's tied up with PDU upgrades this week, and he's out on vacation half of next week. But let's see if we can at least get a timeframe for you. Thanks, Willy

Cmjohnson reassigned this task from Cmjohnson to RobH.Sep 28 2020, 4:56 PM
Cmjohnson added a subscriber: RobH.

@RobH I did not see any signs of burning inside the chassis

RobH added a comment.Sep 29 2020, 4:06 PM

Self dispatch SR1038108849 entered with Chris as the contact. They should call him to schedule the on-site work.

Since this has 'undefined' broken parts, it is easier overall to schedule the Dell tech (since it may be more than one bad part) than try to do half a dozen dispatches of parts over the course of weeks.

RobH reassigned this task from RobH to Cmjohnson.Oct 1 2020, 5:00 PM

Reassigning to Chris, as I listed him as the contact on the self dispatch for the dell tech to contact and arrange a time for the onsite work.

the dell tech came today and replaced the board but did not bring new power supplies...anyway, swapped the board, and the power supplies still burned up

The dell tech is back today with new power supplies, he took the system down to the bare minimum and slowly started adding things back, and once he connected the backplane there was smoke coming from the backplane. He removed the connection and added everything else back without incident. Dell will need to send a backplane.

RobH removed a subscriber: RobH.Oct 5 2020, 9:36 PM

I am not sure why this is not here yet. I am calling Dell to follow up

Spoke with Dell tech, Chris Bennet today. The ball was dropped by Dell, nobody ordered the new part and our case was left open and not owned by anyone. Today a new case for the backplane was opened and it's being elevated to L3 because it could be a safety issue since we did have smoke inside the server. This includes anything from a part replacement to a system exchange. Enterprise Service Request 84193619

I am a little bit disappointed, Dell seems to be lagging a lot. Again, this host has been down since May..

Cmjohnson closed this task as Resolved.Tue, Nov 3, 7:45 PM

@elukey the an-presto1004 motherboard has been replaced and the backplane, everything came back up as normal except I am not able to ssh into the server and fresh install may be needed. While it was down I updated the idrac and bios. I am resolving this as the on-site work has been completed. Please reopen if there is still a problem.

Mentioned in SAL (#wikimedia-operations) [2020-11-04T06:47:03Z] <elukey> set an-presto1004's netbox status as "active" (was: failed) after hw maintenance - T253438