Page MenuHomePhabricator

an-worker1165: Broken RAM
Closed, ResolvedPublic

Description

an-worker1165 went down a few hours ago, looking at SEL it has broken RAM:

Record:      603
Date/Time:   10/29/2024 06:23:40
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at B4. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).

Details

Other Assignee
bking

Event Timeline

DC Ops, this host is hard down, feel free to replace RAM or take any other actions to restore it to working condition at your convenience (this is not an emergency).

Assigning to @VRiley-WMF per our IRC conversation. Feel free to to hit me back if you have any further questions (my handle is inflatador) .

bking updated Other Assignee, added: bking.
bking added a subscriber: VRiley-WMF.

Currently, Dell has this open. We should recieve it tomorrow.

Service request number: 200116199
Work order number: 455501551
Replacement part shipped: 1 x DIMM,32GB,3200,2RX8,16G,DDR4,R.

Mentioned in SAL (#wikimedia-operations) [2024-10-29T19:55:45Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on an-worker1165.eqiad.wmnet with reason: T378454

Mentioned in SAL (#wikimedia-operations) [2024-10-29T19:56:01Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on an-worker1165.eqiad.wmnet with reason: T378454

VRiley-WMF changed the task status from Open to In Progress.Oct 30 2024, 4:40 PM

@bking and @MoritzMuehlenhoff We have recieved the memory and I will replace it very soon. I will update when this is completed.

On the DPE side, I've confirmed that the host is back up and part of the cluster using these instructions (which I just added myself). Moving to "done" on our workboard...