Page MenuHomePhabricator

cp1085 memory errors on DIMM A5
Closed, ResolvedPublic

Description

from racadm getsel:

Record:      118
Date/Time:   03/07/2022 14:54:26
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A5.
-------------------------------------------------------------------------------

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson

Event Timeline

Vgutierrez triaged this task as Medium priority.Mar 7 2022, 3:01 PM
Vgutierrez created this task.

Mentioned in SAL (#wikimedia-operations) [2022-03-07T15:39:58Z] <vgutierrez@cumin1001> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp1085.eqiad.wmnet with reason: HW issues see T303183

Icinga downtime set by vgutierrez@cumin1001 for 30 days, 0:00:00 1 host(s) and their services with reason: HW issues see T303183

cp1085.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-03-07T15:40:03Z] <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp1085.eqiad.wmnet with reason: HW issues see T303183

could we replace the faulty DIMM somehow? missing one server on text@eqiad is far from a ideal scenario

wiki_willy added a subtask: Unknown Object (Task).Mar 7 2022, 5:14 PM
wiki_willy added a subscriber: RobH.

No problem @Vgutierrez. I just created T303203 with @RobH to procure a replacement DIMM

Thanks,
Willy

@Vgutierrez the new DIMM is here, please let me know when I can make the swap

Received the DIMM and replaced it, resolving this task

Cmjohnson closed subtask Unknown Object (Task) as Resolved.Mar 22 2022, 4:11 PM