Page MenuHomePhabricator

mw2264 went down
Closed, ResolvedPublic

Description

mw2264 went down today and is also unreachable via the serial console, can you please have a look?

It has been set as inactive in conftool so that it doesn't interfere with deployments.

Event Timeline

Record:      5
Date/Time:   08/30/2021 10:22:49
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   08/30/2021 10:43:31
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   09/02/2021 09:25:49
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      8
Date/Time:   09/02/2021 09:25:49
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   09/02/2021 09:25:49
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      10
Date/Time:   09/02/2021 09:25:49
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------

@Papaul Hi, this host went down as described and I pasted the relevant entries from 'racadm getsel' above.

As you can see it looks like the DIMM B1 is broken.

It is in rack B3 and purchase date was 2018-02-20.

Can we get the DIMM replaced on warranty? Thank you

Dzahn triaged this task as Medium priority.Thu, Sep 2, 1:16 PM

@Dzahn first let us swap A1 with B1 and see if we still have the error on A1. Memory swap complete and IDRAC upgrade from 2.50 to 2.80. i will leave the task open for now until next week.

thanks

@Dzahn I checked the server today i have no errors showing on A1 closing this task . IF we have the error again please reopen the task.

Thanks

Thank you @Papaul I will repool the server.

Mentioned in SAL (#wikimedia-operations) [2021-09-07T13:49:41Z] <mutante> mw2264 - scap pulled and repooled after T290242