Page MenuHomePhabricator

thumbor1004 memory errors
Open, NormalPublic

Description

It appears that thumbor1004.eqiad.wmnet is having memory issues:

[Wed Feb  6 12:53:56 2019] mce: [Hardware Error]: Machine check events logged
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: TSC 0
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: ADDR cc68f4000
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: MISC 90840800080208c
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549457515 SOCKET 1 APIC 20
[Wed Feb  6 12:53:56 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server has not been depooled as it is not causing service issues for the time being

Related Objects

Event Timeline

jijiki created this task.Wed, Feb 6, 1:35 PM
jijiki triaged this task as Normal priority.
jijiki updated the task description. (Show Details)
jijiki added a subscriber: CDanis.Thu, Feb 7, 4:32 PM
RobH claimed this task.Mon, Feb 11, 5:02 PM
RobH reassigned this task from RobH to jijiki.Mon, Feb 11, 5:11 PM

Ok, so the dimm B1 is reporting bad:

7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet
root@thumbor1004.mgmt.eqiad.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   11/12/2014 05:00:19
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/28/2017 00:35:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/28/2017 00:35:09
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   11/23/2017 21:13:29
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   11/29/2017 11:39:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/21/2018 17:39:06
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/26/2018 13:32:42
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   03/10/2018 09:40:41
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   03/18/2018 07:29:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
/admin1->

The next steps are as follows:

  • offline the system from production use
  • update firmware across system bios/mainboard
  • reseat dimm and see if issue re-occurs, return to service

If the error happens again, then the dimm can be relocated on the board, and see if it follows the error. If it does, then we can RMA it. Dell will require a firmware update and a failure AFTER update to easily send a replacement. (We may be able to argue and get it sent before the firmware update, but its typically not worth the hassle, and firmware update does resolve a large % of the issues.)

@jijiki: What is the process to depool a single thumbnor server? I don't see any specific directions on https://wikitech.wikimedia.org/wiki/Thumbor wikitech page, but perhaps there is a runbook elsewhere?

Basically dc-ops needs this depooled for us to work on it, just need to know the best method to do it! Please advise and assign back to me!

Mentioned in SAL (#wikimedia-operations) [2019-02-11T18:50:20Z] <robh> thumbor1004 rebooted and updated firmware T215411

RobH closed this task as Resolved.Mon, Feb 11, 6:58 PM

Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018

cleared the SEL and if it alerts again, we now have history of troubleshooting to attach for Dell.

If it reoccurs, Chris will need to swap memory slots around and see if the error follows the dimm.

RobH added a comment.Mon, Feb 11, 6:59 PM

@jijiki pinged you in irc as well, can you return this system to service?

@RobH Server has been repooled

Mentioned in SAL (#wikimedia-operations) [2019-02-11T19:08:20Z] <jijiki> Repooled thumbor1004 - T215411

jijiki reopened this task as Open.Tue, Feb 12, 9:57 AM
[Tue Feb 12 06:13:31 2019] mce: [Hardware Error]: Machine check events logged
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: TSC 0
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: ADDR cc68f4000
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: MISC 90840800080208c
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549952008 SOCKET 1 APIC 20
[Tue Feb 12 06:13:31 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server still reports memory issues

jijiki reassigned this task from jijiki to RobH.Tue, Feb 12, 9:58 AM
jijiki moved this task from Backlog to Doing on the serviceops board.Tue, Feb 12, 10:02 AM

@RobH How should we proceed?