Page MenuHomePhabricator

(OoW) thumbor1004 memory errors
Closed, DeclinedPublic

Description

It appears that thumbor1004.eqiad.wmnet is having memory issues:

[Wed Feb  6 12:53:56 2019] mce: [Hardware Error]: Machine check events logged
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: TSC 0
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: ADDR cc68f4000
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: MISC 90840800080208c
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549457515 SOCKET 1 APIC 20
[Wed Feb  6 12:53:56 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server has not been depooled as it is not causing service issues for the time being

Related Objects

Event Timeline

jijiki triaged this task as Normal priority.Feb 6 2019, 1:35 PM
jijiki created this task.
jijiki updated the task description. (Show Details)
jijiki added a subscriber: CDanis.Feb 7 2019, 4:32 PM
RobH claimed this task.Feb 11 2019, 5:02 PM
RobH reassigned this task from RobH to jijiki.Feb 11 2019, 5:11 PM

Ok, so the dimm B1 is reporting bad:

7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet
root@thumbor1004.mgmt.eqiad.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   11/12/2014 05:00:19
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/28/2017 00:35:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/28/2017 00:35:09
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   11/23/2017 21:13:29
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   11/29/2017 11:39:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/21/2018 17:39:06
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/26/2018 13:32:42
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   03/10/2018 09:40:41
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   03/18/2018 07:29:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
/admin1->

The next steps are as follows:

  • offline the system from production use
  • update firmware across system bios/mainboard
  • reseat dimm and see if issue re-occurs, return to service

If the error happens again, then the dimm can be relocated on the board, and see if it follows the error. If it does, then we can RMA it. Dell will require a firmware update and a failure AFTER update to easily send a replacement. (We may be able to argue and get it sent before the firmware update, but its typically not worth the hassle, and firmware update does resolve a large % of the issues.)

@jijiki: What is the process to depool a single thumbnor server? I don't see any specific directions on https://wikitech.wikimedia.org/wiki/Thumbor wikitech page, but perhaps there is a runbook elsewhere?

Basically dc-ops needs this depooled for us to work on it, just need to know the best method to do it! Please advise and assign back to me!

Mentioned in SAL (#wikimedia-operations) [2019-02-11T18:50:20Z] <robh> thumbor1004 rebooted and updated firmware T215411

RobH closed this task as Resolved.Feb 11 2019, 6:58 PM

Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018

cleared the SEL and if it alerts again, we now have history of troubleshooting to attach for Dell.

If it reoccurs, Chris will need to swap memory slots around and see if the error follows the dimm.

RobH added a comment.Feb 11 2019, 6:59 PM

@jijiki pinged you in irc as well, can you return this system to service?

@RobH Server has been repooled

Mentioned in SAL (#wikimedia-operations) [2019-02-11T19:08:20Z] <jijiki> Repooled thumbor1004 - T215411

jijiki reopened this task as Open.Feb 12 2019, 9:57 AM
[Tue Feb 12 06:13:31 2019] mce: [Hardware Error]: Machine check events logged
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: TSC 0
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: ADDR cc68f4000
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: MISC 90840800080208c
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549952008 SOCKET 1 APIC 20
[Tue Feb 12 06:13:31 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server still reports memory issues

jijiki reassigned this task from jijiki to RobH.Feb 12 2019, 9:58 AM
jijiki moved this task from Backlog to Doing on the serviceops board.Feb 12 2019, 10:02 AM

@RobH How should we proceed?

RobH added a comment.Mar 12 2019, 6:43 PM
This comment was removed by RobH.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T19:07:09Z] <robh> rebooting thumbor1004 for memory troubleshooting via T215411

RobH reassigned this task from RobH to Cmjohnson.Mar 12 2019, 7:27 PM

Ok, updating this after IRC discussion for clarity.

This system has had repeated memory errors reported above, and the SEL previously showed errors on dimm b1.

I ran dell epsa test, and it resulted in ANOTHER b1 dimm failure:

Record: 2
Date/Time: 03/12/2019 19:22:09
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

So, now the next step is for @Cmjohnson to swap dimm b1 and b2 around, and see if the error moves to b2 (bad dimm), or stays in the slot (bad mainboard).

RobH added a comment.Mar 12 2019, 7:38 PM

Ok, system is powered back on for now. The next steps are as follows:

  • @Cmjohnson to sync with @jijiki tomorrow before powering down system.
    • this typically only involves silencing alerts on icinga, running 'sudo -i depool' on thumbor1004, waiting for 5 minutes, and then powering off.
  • swap (possibly defective) dimm from b1 to b2
  • clear the SEL before running hw tests
  • rerun the dell epsa tests, since that showed the memory failure previously.
    • error should re-occur in either b2 (due to bad memory) or in b1 (due to bad slot)
    • if it is in b2, bad memory, b1 means bad mainboard, replacement from dell should be requested
  • return system to service if it is only a bad dimm while waiting on replacement, since a single bad dimm isn't worth keeping the entire host offline

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:15:24Z] <jijiki> Depool thumbor1004 to investigate memory issues - T215411

Server has been depooled and downtimed on icinga for 48 hours, @Cmjohnson you can power it down any time, tx :)

I moved DIMM from B side to A side and cleared the log...let's give it a day or so and see if the error follows.

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:30:29Z] <robh> thumbor1004 memtest in progress via T215411

RobH added a comment.EditedMar 13 2019, 6:35 PM

So attempting to boot this system shows:

Error: Memory initialization warning detected.

I cannot get what dimm has error, since it is overwritten on serial console. Chris will need to crash cart it and determine if the dimm error moved with the dimm or stayed in the slot, and open a support case with Dell.

The SEL doesn't show what slot, so crash cart output should.

RobH added a comment.Mar 13 2019, 6:36 PM
Error: Memory initialization warning detected.
Management Engine Mode                : Active
ManagementeEngineeFirmwaretVersiongent: 0002.0001
Copyright (C) 2000-2014 BroPatch Corpo:a0005
Alllrightslreserved. BIOS VBuildn 1.0.: 008B
CStrikerthenF1okeyttoYcontinue,tF2ntoerun the system setup program
RobH added a comment.Mar 13 2019, 6:38 PM

Next steps:

  • Chris attach crash cart
    • output on crash cart won't be overwritten like serial console, so the POST error will denote which dimm is bad.
    • if it is the same dimm that Chris just moved, he can open a case to have it replaced by Dell
    • if it is the slot, not the dimm, Chris can open a case to have the mainboard replaced by Dell
  • either dispatch is next day, @RobH recommends we leave this system depooled until new part arrives.

DIMM A1 is now showing bad so it looks a DIMM replacement is needed.

Should've checked this but thumbor1004 is out of warranty.

We are still having errors

[Thu Mar 14 14:42:19 2019] mce: [Hardware Error]: Machine check events logged
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: TSC 0
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: ADDR 4c68f4000
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: MISC 90840800080208c
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1552574533 SOCKET 0 APIC 0
[Thu Mar 14 14:42:19 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4c68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:1)
RobH claimed this task.Mar 14 2019, 5:08 PM

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

@jijiki: So there isn't a entry for replacing these this fiscal, but we may be able to order anyhow. I'll file a procurement task and add you to it shortly.

Dzahn added a subscriber: Dzahn.Apr 4 2019, 12:45 PM

fresh Icinga alert for this since about 23 hours

@RobH do we have an update?

jijiki moved this task from Backlog/Radar to St on the User-jijiki board.Apr 4 2019, 9:25 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-05T07:34:16Z] <jijiki> Repooling thumbor1004 until we replace its memory - T215411

The server is out of warranty, can we get a replacement or use a spare replacement?

Dzahn added a comment.Apr 16 2019, 6:47 PM

The server is out of warranty, can we get a replacement or use a spare replacement?

Yes, per Robh:

So this has a memory error and is out of warranty.
This means we should look at decommissioning this host and ordering a replacement.

RobH mentioned this in Unknown Object (Task).Apr 16 2019, 6:52 PM
RobH added a subtask: Unknown Object (Task).
jijiki moved this task from Backlog to Doing on the Thumbor board.Jun 18 2019, 9:52 PM
jijiki moved this task from Doing to Next up on the serviceops board.Jun 24 2019, 3:40 PM
wiki_willy renamed this task from thumbor1004 memory errors to (OoW) thumbor1004 memory errors.Jul 2 2019, 9:43 PM
jijiki moved this task from Next up to Backlog on the serviceops board.Jul 5 2019, 9:27 AM
Cmjohnson closed this task as Declined.Jul 11 2019, 11:35 PM

Declining the task since the server is out of warranty.