Page MenuHomePhabricator

pc2006 down
Closed, ResolvedPublic

Description

We do not know if it is network or it is really down (crashed).

Sadly we cannot connect to the management interface to debug it.

Event Timeline

Restricted Application added subscribers: Zppix, Southparkfan, Aklapper. · View Herald Transcript

System was stuck because of memory error.
log shows multiple bit memory errors on DIMM B7 and B8. Pulled the power out for 5 minutes and plugged power back. System boot and was not stuck this time. I am leaving this task open for now to see if i have the same DIMM errors in the next days.

idrac-7D3H282 - iDRAC8 - Logs - Mozilla Firefox_008.png (873×1 px, 187 KB)

So the memory errors will likely re-occur and this is an under-warranty system. I'd advise not waiting for it to happen again. Anytime this happens, a racadm getsel in the ilom cli will get a text output of the log:

/admin1-> racadm getsel
Record:      1
Date/Time:   12/18/2015 20:35:18
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   12/24/2015 11:53:28
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   12/24/2015 11:53:28
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   12/24/2015 12:12:34
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      5
Date/Time:   12/24/2015 12:12:34
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/05/2016 15:35:52
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/05/2016 15:35:52
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   05/08/2016 22:56:15
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      9
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B8.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   06/18/2016 23:08:48
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   07/03/2016 13:19:55
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B8.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   07/03/2016 13:21:41
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   07/03/2016 13:21:41
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B8.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   07/03/2016 13:21:41
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------

We can see an old memory error in B2, that doesn't happen again. However, the B7 & B8 errors happen over the course of a few months.

I'd advise swapping the bad dimms in B7 and B8 via Dell warranty repair. Since this is a lease, the hardware update needs to be tracked on the appropriate tracking sheet.

I could be mistaken though, as Jaime has pointed out a longstanding memory issue with some systems (involving bios updates and the like). That background makes these a bit odd.

I am going to restart this system once, then check the logs and follow up with papaul for a part replacement as a start. We see will how it behaves.

It could be just a simple hardware failure, or it could be related to T130702.

Mentioned in SAL [2016-07-05T16:44:36Z] <jynus> rebooting pc2006 T139283

No memory errors after restart (but that is the worst of the outcomes, because it doesn't confirm anything).

Hi Papaul,

This server (7D3H282) has pro support so technically I cannot help you with it but if you will update the bios to 2.1.7:

http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=V99PP&fileId=3549897043&osCode=W12R2&productCode=poweredge-r630&languageCode=en&categoryId=BI

This should fix it. All of the 13th gen need to have the Bios's updated to keep them from getting memory errors.

Papaul, this is not sure that is going to happen, but could we generate a list of all servers potentially affected- that need update? Maybe they are very few and we can actually do them?

If support gives us any crap about pro versus basic on this, we can pull in our account reps and make them help us. (We spend far more than enough with dell to get the help we need.)

Mentioned in SAL [2016-07-07T16:47:25Z] <jynus> stopping pc2006 for hardware maintenance T139283

Bios update from 1.5.4 to 2.1.7

/admin1-> racadm getsel
Record:      1
Date/Time:   12/18/2015 20:35:18
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   12/24/2015 11:53:28
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   12/24/2015 11:53:28
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   12/24/2015 12:12:34
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      5
Date/Time:   12/24/2015 12:12:34
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/05/2016 15:35:52
Source:      system
Severity:    Ok
Description: A problem was detected in Memory Reference Code (MRC).
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/05/2016 15:35:52
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   05/08/2016 22:56:15
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      9
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B8.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   05/08/2016 21:57:47
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   06/18/2016 23:08:48
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   07/03/2016 13:19:55
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B8.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   07/03/2016 13:21:41
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   07/03/2016 13:21:41
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B8.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   07/03/2016 13:21:41
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B7.
-------------------------------------------------------------------------------

I see no errors either on the web interface.

Should we plan a general upgrade of all affected machines, or should we wait in case it fails again?

I think we can plan a general upgrade since it takes not more than 5 minutes to do the upgrade on a system. I will check and see how many systems are affected.

please see below for servers that we need to upgrade

This affects all PowerEdge R730 and R630

es2011
es2012
es2013
es2014
es2015
es2016
es2017 done
es2018
es2019 done
pc2004
pc2005
pc2006 done