Page MenuHomePhabricator

Memory correctable errors -EDAC- elastic1029
Closed, ResolvedPublic

Description

There's a warning alert on icinga about Memory correctable errors -EDAC- on elastic1029.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2019, 9:31 AM
Mathew.onipe closed this task as Invalid.Jan 23 2019, 8:54 AM

I'm closing this task as invalid I no longer see any error

Dzahn reopened this task as Open.May 11 2019, 1:48 AM
Dzahn added a subscriber: Dzahn.

It's back:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1029&service=Memory+correctable+errors+-EDAC-

Current Status: CRITICAL
(for 1d 10h 32m 40s)
Status Information: 4.001 ge 4

ArielGlenn triaged this task as Normal priority.May 15 2019, 10:41 AM

Error reseted as documented in Monitoring/Memory.

@Cmjohnson this seems to happen often enough that we probably need to hove a look at those memory modules. What do you need from me to move forward?

Note that you can downtime and shutdown this server whenever you need.

Gehel added a comment.Jun 11 2019, 8:37 AM

@Cmjohnson any news on this? Do you need anything from our side?

@Gehel you will need to take the server offline for a day so I can reseat the DIMM. The server logs do not indicate any memory errors. If you want to downtime it for Wednesday or Thursday let me know.

Mentioned in SAL (#wikimedia-operations) [2019-06-11T15:41:10Z] <gehel> shutting down elastic1029 for investigation - T214283

Gehel added a comment.Jun 11 2019, 3:42 PM

@Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.

Mentioned in SAL (#wikimedia-operations) [2019-06-18T11:22:35Z] <akosiaris> set elastic1029 as inactive in all conftool data. Command was sudo confctl select "name=elastic1029.eqiad.wmnet" set/pooled=inactive T214283

The DIMM has been reseated and swapped to the opposite sides.

Cmjohnson closed this task as Resolved.Jun 19 2019, 4:10 PM
Cmjohnson claimed this task.

Closing this for now, let me know if there is another issue. Keep in mind this server is out of warranty

Mentioned in SAL (#wikimedia-operations) [2019-06-19T16:23:29Z] <onimisionipe> pooling elastic1029 - T214283

Papaul removed a subscriber: Papaul.Aug 29 2019, 3:13 AM
RobH added a comment.Aug 29 2019, 3:45 PM

This hsows no errors in the service event log for the memory:

/admin1-> racadm getsel
Record:      1
Date/Time:   10/06/2014 10:01:16
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   10/23/2014 13:56:59
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   10/23/2014 13:57:04
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   08/03/2017 14:22:24
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   08/03/2017 14:22:30
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   08/16/2018 16:19:51
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   08/16/2018 16:19:57
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   06/19/2019 15:44:46
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   06/19/2019 15:44:51
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------
/admin1->

The next step would be to reboot the machine into the Dell ePSA tool and run memtest via that tool. Can this system be taken offline for this work?

I don't see any SEL paste into this task showing the original errors, and the log is quite long and old, so the old error also didn't show in the SEL.

RobH added a comment.Aug 29 2019, 3:47 PM

Also, in the future, please open a new task for hardware troubleshooting and follow all directions on:

https://phabricator.wikimedia.org/maniphest/task/edit/form/55/

Dzahn removed a subscriber: Dzahn.Aug 29 2019, 4:27 PM
debt closed this task as Resolved.Sep 5 2019, 6:39 PM
debt added a subscriber: debt.

Also, in the future, please open a new task for hardware troubleshooting and follow all directions on:
https://phabricator.wikimedia.org/maniphest/task/edit/form/55/

Hi @Mathew.onipe - can you create a new ticket for the 'new' errors, please? Thanks!