Page MenuHomePhabricator

(OoW) thumbor1004 memory errors
Closed, DeclinedPublic

Description

It appears that thumbor1004.eqiad.wmnet is having memory issues:

[Wed Feb  6 12:53:56 2019] mce: [Hardware Error]: Machine check events logged
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: TSC 0
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: ADDR cc68f4000
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: MISC 90840800080208c
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549457515 SOCKET 1 APIC 20
[Wed Feb  6 12:53:56 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server has not been depooled as it is not causing service issues for the time being

Event Timeline

jijiki triaged this task as Medium priority.Feb 6 2019, 1:35 PM
jijiki created this task.
jijiki updated the task description. (Show Details)

Ok, so the dimm B1 is reporting bad:

7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet
root@thumbor1004.mgmt.eqiad.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   11/12/2014 05:00:19
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/28/2017 00:35:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/28/2017 00:35:09
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   11/23/2017 21:13:29
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   11/29/2017 11:39:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/21/2018 17:39:06
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/26/2018 13:32:42
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   03/10/2018 09:40:41
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   03/18/2018 07:29:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
/admin1->

The next steps are as follows:

  • offline the system from production use
  • update firmware across system bios/mainboard
  • reseat dimm and see if issue re-occurs, return to service

If the error happens again, then the dimm can be relocated on the board, and see if it follows the error. If it does, then we can RMA it. Dell will require a firmware update and a failure AFTER update to easily send a replacement. (We may be able to argue and get it sent before the firmware update, but its typically not worth the hassle, and firmware update does resolve a large % of the issues.)

@jijiki: What is the process to depool a single thumbnor server? I don't see any specific directions on https://wikitech.wikimedia.org/wiki/Thumbor wikitech page, but perhaps there is a runbook elsewhere?

Basically dc-ops needs this depooled for us to work on it, just need to know the best method to do it! Please advise and assign back to me!

Mentioned in SAL (#wikimedia-operations) [2019-02-11T18:50:20Z] <robh> thumbor1004 rebooted and updated firmware T215411

Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018

cleared the SEL and if it alerts again, we now have history of troubleshooting to attach for Dell.

If it reoccurs, Chris will need to swap memory slots around and see if the error follows the dimm.

@jijiki pinged you in irc as well, can you return this system to service?

[Tue Feb 12 06:13:31 2019] mce: [Hardware Error]: Machine check events logged
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: TSC 0
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: ADDR cc68f4000
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: MISC 90840800080208c
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549952008 SOCKET 1 APIC 20
[Tue Feb 12 06:13:31 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server still reports memory issues

This comment was removed by RobH.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T19:07:09Z] <robh> rebooting thumbor1004 for memory troubleshooting via T215411

Ok, updating this after IRC discussion for clarity.

This system has had repeated memory errors reported above, and the SEL previously showed errors on dimm b1.

I ran dell epsa test, and it resulted in ANOTHER b1 dimm failure:

Record: 2
Date/Time: 03/12/2019 19:22:09
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

So, now the next step is for @Cmjohnson to swap dimm b1 and b2 around, and see if the error moves to b2 (bad dimm), or stays in the slot (bad mainboard).

Ok, system is powered back on for now. The next steps are as follows:

  • @Cmjohnson to sync with @jijiki tomorrow before powering down system.
    • this typically only involves silencing alerts on icinga, running 'sudo -i depool' on thumbor1004, waiting for 5 minutes, and then powering off.
  • swap (possibly defective) dimm from b1 to b2
  • clear the SEL before running hw tests
  • rerun the dell epsa tests, since that showed the memory failure previously.
    • error should re-occur in either b2 (due to bad memory) or in b1 (due to bad slot)
    • if it is in b2, bad memory, b1 means bad mainboard, replacement from dell should be requested
  • return system to service if it is only a bad dimm while waiting on replacement, since a single bad dimm isn't worth keeping the entire host offline

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:15:24Z] <jijiki> Depool thumbor1004 to investigate memory issues - T215411

Server has been depooled and downtimed on icinga for 48 hours, @Cmjohnson you can power it down any time, tx :)

I moved DIMM from B side to A side and cleared the log...let's give it a day or so and see if the error follows.

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:30:29Z] <robh> thumbor1004 memtest in progress via T215411

So attempting to boot this system shows:

Error: Memory initialization warning detected.

I cannot get what dimm has error, since it is overwritten on serial console. Chris will need to crash cart it and determine if the dimm error moved with the dimm or stayed in the slot, and open a support case with Dell.

The SEL doesn't show what slot, so crash cart output should.

Error: Memory initialization warning detected.
Management Engine Mode                : Active
ManagementeEngineeFirmwaretVersiongent: 0002.0001
Copyright (C) 2000-2014 BroPatch Corpo:a0005
Alllrightslreserved. BIOS VBuildn 1.0.: 008B
CStrikerthenF1okeyttoYcontinue,tF2ntoerun the system setup program

Next steps:

  • Chris attach crash cart
    • output on crash cart won't be overwritten like serial console, so the POST error will denote which dimm is bad.
    • if it is the same dimm that Chris just moved, he can open a case to have it replaced by Dell
    • if it is the slot, not the dimm, Chris can open a case to have the mainboard replaced by Dell
  • either dispatch is next day, @RobH recommends we leave this system depooled until new part arrives.

DIMM A1 is now showing bad so it looks a DIMM replacement is needed.

Should've checked this but thumbor1004 is out of warranty.

We are still having errors

[Thu Mar 14 14:42:19 2019] mce: [Hardware Error]: Machine check events logged
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: TSC 0
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: ADDR 4c68f4000
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: MISC 90840800080208c
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1552574533 SOCKET 0 APIC 0
[Thu Mar 14 14:42:19 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4c68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:1)

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

@jijiki: So there isn't a entry for replacing these this fiscal, but we may be able to order anyhow. I'll file a procurement task and add you to it shortly.

fresh Icinga alert for this since about 23 hours

Mentioned in SAL (#wikimedia-operations) [2019-04-05T07:34:16Z] <jijiki> Repooling thumbor1004 until we replace its memory - T215411

The server is out of warranty, can we get a replacement or use a spare replacement?

The server is out of warranty, can we get a replacement or use a spare replacement?

Yes, per Robh:

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

RobH mentioned this in Unknown Object (Task).Apr 16 2019, 6:52 PM
RobH added a subtask: Unknown Object (Task).
wiki_willy renamed this task from thumbor1004 memory errors to (OoW) thumbor1004 memory errors.Jul 2 2019, 9:43 PM

Declining the task since the server is out of warranty.

jijiki changed the status of subtask Unknown Object (Task) from Open to Stalled.Sep 11 2019, 2:07 PM

server still alerting nowadays

I have disabled this check for now

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

Created decom ticket per Server Lifecycle

Using the special Phabricator form for decom

As part of that we should run the decom script which will remove it from Icinga properly.

@Dzahn I do not know yet when this server will be decommissioned, we have quite some work ahead of us before moving thumbor to k8s

@jijiki I assumed it is broken anyways. Can it run despite the memory error?

Nevermind then, i declined the decom ticket again.

We have not noticed anything weird so far, I reckon it should be ok for a little longer

jijiki closed subtask Unknown Object (Task) as Invalid.Nov 1 2019, 4:21 PM
Jelto reopened subtask Unknown Object (Task) as Open.Aug 18 2021, 2:42 PM
Jelto closed subtask Unknown Object (Task) as Invalid.Aug 18 2021, 2:49 PM