Page MenuHomePhabricator

Broken memory/CPU on mw1275
Closed, ResolvedPublic


While reimaging mw1275 went down, "racadm getsel" lists a couple of errors, but I'm unsure if the CPU error is fallout from the broken DIMM or whether both are faulty?

The server is depooled, you can power if off for maintenance at any time.

Record:      2
Date/Time:   04/24/2018 09:53:04
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
Record:      3
Date/Time:   04/24/2018 09:53:13
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
Record:      4
Date/Time:   04/24/2018 10:22:07
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
Record:      5
Date/Time:   04/24/2018 10:22:36
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
Record:      6
Date/Time:   04/24/2018 10:22:36
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
Record:      7
Date/Time:   04/24/2018 12:18:26
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

Event Timeline

I swapped the DIMM with A1, cleared SEL and powered back on. Let's see if the error returns and/or moves.

@MoritzMuehlenhoff The error has not returned, go ahead and re-install. The error was correctable, so moving and reseating may have fixed the issue.

The server has been reimaged and is currently serving production traffic without any issues, closing.