Page MenuHomePhabricator

codfw: ml-serve2001 memmory issue DIMM A2
Closed, ResolvedPublic

Description

The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process.

Please reboot the server to see if to clear out the error:

Thanks.

Event Timeline

Volans triaged this task as High priority.Jul 26 2022, 6:23 PM
Volans added a project: Machine-Learning-Team.
Volans added subscribers: klausman, elukey.

@Papaul host rebooted! It is not running any K8s pods at the moment so if any maintenance is needed, feel free to downtime and go ahead :)

For the ML-Team - the node is cordoned, please uncordon before closing :)

Icinga downtime and Alertmanager silence (ID=b087dff3-f32b-4842-9f10-401f09f59c0c) set by klausman@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: memtest86+ run

ml-serve2001.codfw.wmnet

Ok, the machine is booted and sitting in GRUB. @Papaul I can't seem to run memtes86+ via idrac (I just get a black screen). Can you check whether it works with direct access? Alternatively, do you know how to run it so that console redirection works? Thanks!

The reboot fixed the DIMM error for now:

	The self-heal operation successfully completed at DIMM DIMM_A2. 	Wed 27 Jul 2022 09:06:24
	The self-heal operation successfully completed at DIMM DIMM_B1. 	Wed 27 Jul 2022 09:06:24
	The self-heal operation successfully completed at DIMM DIMM_B2. 	Wed 27 Jul 2022 09:06:24
	The self-heal operation successfully completed at DIMM DIMM_B2. 	Wed 27 Jul 2022 09:06:24
Papaul claimed this task.