Page MenuHomePhabricator

Reboot lvs1019 for memory self-healing
Closed, ResolvedPublic

Description

I came across this warning in lvs1019's iDRAC interface, set since Sun Apr 12 2026 04:32:43:

The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process.

I'm not seeing any kernel errors related to memory... Regardless, let's depool/reboot lvs1019 so it can self-heal.

Further reading: https://www.dell.com/support/kbdoc/en-us/000053203/what-is-ddr4-self-healing-on-dell-poweredge-servers-with-intel-xeon-scalable-processors

Dell recommends updating BIOS before rebooting:

update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements

Event Timeline

BCornwall triaged this task as Medium priority.May 12 2026, 6:05 PM

Once we reboot for T426585, we can consider this resolved as well.

@ssingh The Dell docs mention updating the BIOS:

update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements

However, that would put our version ahead of the rest of the fleet. Is that acceptable or should I just keep it as-is?

@ssingh The Dell docs mention updating the BIOS:

update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements

However, that would put our version ahead of the rest of the fleet. Is that acceptable or should I just keep it as-is?

I think it should be fine as a one-off I think? Also we will be doing away with these in Q1/Q2 as part of the eqiad LVS refresh, so they will be eventually upgraded.

Mentioned in SAL (#wikimedia-operations) [2026-05-28T18:34:42Z] <brett> Stopping pybal/puppet/downtiming lvs1019.eqiad.wmnet for reboot and BIOS update/memory self-healing - T426109

BIOS updated to 2.27.0.

After a reboot:

The self-heal operation successfully completed at DIMM DIMM_B2.
A problem was detected during Power-On Self-Test (POST).
The self-heal operation successfully completed at DIMM DIMM_B2.

I rebooted a second time cold just in case of the scary message in the middle there but all seems well.