Page MenuHomePhabricator

cp5002 memory errors on DIMM A4
Closed, ResolvedPublic

Description

cp5002.eqsin.wmnet reports the following error from racadm getsel:

Record:      36
Date/Time:   04/05/2022 01:10:38
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      37
Date/Time:   04/05/2022 01:10:38
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A4.
-------------------------------------------------------------------------------

Related Objects

StatusSubtypeAssignedTask
Resolvedssingh

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-04-05T01:59:26Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423

Mentioned in SAL (#wikimedia-operations) [2022-04-05T01:59:28Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423

wiki_willy subscribed.

Hi @ssingh - since this server is out of warranty and due to be refreshed in a few quarters, do you still want us to purchase a replacement DIMM to keep it up and running in the meantime or are you able to wait it out? Thanks, Willy

RobH triaged this task as Medium priority.Apr 11 2022, 2:19 PM
RobH moved this task from Backlog to Hardware Failure / Repair on the ops-eqsin board.

Hi @ssingh - since this server is out of warranty and due to be refreshed in a few quarters, do you still want us to purchase a replacement DIMM to keep it up and running in the meantime or are you able to wait it out? Thanks, Willy

Hi @wiki_willy: Sorry for the delayed response; I wanted to discuss this with the rest of the team. We decided that it makes sense to proceed with the replacement DIMM for now to keep the server running.

Thank you for checking!

Thanks @ssingh. Rob's working on sourcing the replacement DIMM, so we should have that sorted out soon, and will keep you in the loop via an adjacent procurement task. Thanks, Willy

RobH added a subtask: Unknown Object (Task).Apr 12 2022, 7:51 PM

Jin will be onsite on May 4th @ 9AM Singapore Time to swap this memory out

Order Number - 1-216864938761

RobH subscribed.

This host has had its ram replaced and booted into the OS successfully, detecting all memory without errors.

When we were replacing the memory, it forgot its bios time/date as the CR battery on the mainboard has discharged. This is only an issue if power is entirely lost, requiring the date/time to be set again. I'm not sure that its worth the downtime and such to swap out mainboard batteries when these are due for replacement in Q3 of next fiscal.

@ssingh: Would you be the one to return this host to service? If so, can you do so and resolve this task when done? Thank you!

This host has had its ram replaced and booted into the OS successfully, detecting all memory without errors.

When we were replacing the memory, it forgot its bios time/date as the CR battery on the mainboard has discharged. This is only an issue if power is entirely lost, requiring the date/time to be set again. I'm not sure that its worth the downtime and such to swap out mainboard batteries when these are due for replacement in Q3 of next fiscal.

@ssingh: Would you be the one to return this host to service? If so, can you do so and resolve this task when done? Thank you!

Thanks very much for your help @RobH! I have returned the host to service and we can consider this resolved. Much appreciated!

@ssingh the host is still has failed as status in netbox
https://netbox.wikimedia.org/dcim/devices/1611/

Thanks for letting me know @PPaul; resolving some errors on the host and then I will change the status in Netbox.

RobH closed subtask Unknown Object (Task) as Resolved.May 18 2022, 10:36 PM
This comment was removed by RobH.