(OoW) thumbor1004 memory errors
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	jijiki
	Feb 6 2019, 1:35 PM

Description

It appears that thumbor1004.eqiad.wmnet is having memory issues:

[Wed Feb  6 12:53:56 2019] mce: [Hardware Error]: Machine check events logged
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: TSC 0
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: ADDR cc68f4000
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: MISC 90840800080208c
[Wed Feb  6 12:53:56 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549457515 SOCKET 1 APIC 20
[Wed Feb  6 12:53:56 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server has not been depooled as it is not causing service issues for the time being

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		RobH	T215411 (OoW) thumbor1004 memory errors
			Unknown Object (Task)
Declined	Request	None	T233827 decommission thumbor1004

Event Timeline

jijiki triaged this task as Medium priority.Feb 6 2019, 1:35 PM

jijiki created this task.

jijiki updated the task description. (Show Details)

jijiki added subscribers: RobH, • Cmjohnson.Feb 6 2019, 1:42 PM

jijiki added a subscriber: CDanis.Feb 7 2019, 4:32 PM

RobH claimed this task.Feb 11 2019, 5:02 PM

RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Feb 11 2019, 5:05 PM

Ok, so the dimm B1 is reporting bad:

7 $> ssh root@thumbor1004.mgmt.eqiad.wmnet
root@thumbor1004.mgmt.eqiad.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   11/12/2014 05:00:19
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/28/2017 00:35:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/28/2017 00:35:09
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   11/23/2017 21:13:29
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   11/29/2017 11:39:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/21/2018 17:39:06
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/26/2018 13:32:42
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   03/10/2018 09:40:41
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   03/18/2018 07:29:04
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
/admin1->

The next steps are as follows:

offline the system from production use
update firmware across system bios/mainboard
reseat dimm and see if issue re-occurs, return to service

If the error happens again, then the dimm can be relocated on the board, and see if it follows the error. If it does, then we can RMA it. Dell will require a firmware update and a failure AFTER update to easily send a replacement. (We may be able to argue and get it sent before the firmware update, but its typically not worth the hassle, and firmware update does resolve a large % of the issues.)

@jijiki: What is the process to depool a single thumbnor server? I don't see any specific directions on https://wikitech.wikimedia.org/wiki/Thumbor wikitech page, but perhaps there is a runbook elsewhere?

Basically dc-ops needs this depooled for us to work on it, just need to know the best method to do it! Please advise and assign back to me!

Mentioned in SAL (#wikimedia-operations) [2019-02-11T18:50:20Z] <robh> thumbor1004 rebooted and updated firmware T215411

Ok, updated firmware to System BIOS Version = 2.6.0 revision date of 28 Jun 2018

cleared the SEL and if it alerts again, we now have history of troubleshooting to attach for Dell.

If it reoccurs, Chris will need to swap memory slots around and see if the error follows the dimm.

@jijiki pinged you in irc as well, can you return this system to service?

@RobH Server has been repooled

Mentioned in SAL (#wikimedia-operations) [2019-02-11T19:08:20Z] <jijiki> Repooled thumbor1004 - T215411

[Tue Feb 12 06:13:31 2019] mce: [Hardware Error]: Machine check events logged
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: TSC 0
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: ADDR cc68f4000
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: MISC 90840800080208c
[Tue Feb 12 06:13:31 2019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1549952008 SOCKET 1 APIC 20
[Tue Feb 12 06:13:31 2019] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Server still reports memory issues

jijiki reassigned this task from jijiki to RobH.Feb 12 2019, 9:58 AM

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Feb 12 2019, 10:02 AM

CDanis merged a task: T207721: Broken memory on thumbor1004.Feb 14 2019, 2:08 PM

CDanis added subscribers: MoritzMuehlenhoff, • Gilles.

@RobH How should we proceed?

RobH added a comment.Mar 12 2019, 6:43 PM

This comment was removed by RobH.

Mentioned in SAL (#wikimedia-operations) [2019-03-12T19:07:09Z] <robh> rebooting thumbor1004 for memory troubleshooting via T215411

Ok, updating this after IRC discussion for clarity.

This system has had repeated memory errors reported above, and the SEL previously showed errors on dimm b1.

I ran dell epsa test, and it resulted in ANOTHER b1 dimm failure:

Record: 2
Date/Time: 03/12/2019 19:22:09
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

So, now the next step is for @Cmjohnson to swap dimm b1 and b2 around, and see if the error moves to b2 (bad dimm), or stays in the slot (bad mainboard).

Ok, system is powered back on for now. The next steps are as follows:

@Cmjohnson to sync with @jijiki tomorrow before powering down system.
- this typically only involves silencing alerts on icinga, running 'sudo -i depool' on thumbor1004, waiting for 5 minutes, and then powering off.
swap (possibly defective) dimm from b1 to b2
clear the SEL before running hw tests
rerun the dell epsa tests, since that showed the memory failure previously.
- error should re-occur in either b2 (due to bad memory) or in b1 (due to bad slot)
- if it is in b2, bad memory, b1 means bad mainboard, replacement from dell should be requested
return system to service if it is only a bad dimm while waiting on replacement, since a single bad dimm isn't worth keeping the entire host offline

jijiki added a project: User-jijiki.Mar 13 2019, 4:15 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-13T16:15:24Z] <jijiki> Depool thumbor1004 to investigate memory issues - T215411

Server has been depooled and downtimed on icinga for 48 hours, @Cmjohnson you can power it down any time, tx :)

I moved DIMM from B side to A side and cleared the log...let's give it a day or so and see if the error follows.

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:30:29Z] <robh> thumbor1004 memtest in progress via T215411

So attempting to boot this system shows:

Error: Memory initialization warning detected.

I cannot get what dimm has error, since it is overwritten on serial console. Chris will need to crash cart it and determine if the dimm error moved with the dimm or stayed in the slot, and open a support case with Dell.

The SEL doesn't show what slot, so crash cart output should.

Error: Memory initialization warning detected.
Management Engine Mode                : Active
ManagementeEngineeFirmwaretVersiongent: 0002.0001
Copyright (C) 2000-2014 BroPatch Corpo:a0005
Alllrightslreserved. BIOS VBuildn 1.0.: 008B
CStrikerthenF1okeyttoYcontinue,tF2ntoerun the system setup program

Next steps:

Chris attach crash cart
- output on crash cart won't be overwritten like serial console, so the POST error will denote which dimm is bad.
- if it is the same dimm that Chris just moved, he can open a case to have it replaced by Dell
- if it is the slot, not the dimm, Chris can open a case to have the mainboard replaced by Dell
either dispatch is next day, @RobH recommends we leave this system depooled until new part arrives.

DIMM A1 is now showing bad so it looks a DIMM replacement is needed.

Should've checked this but thumbor1004 is out of warranty.

We are still having errors

[Thu Mar 14 14:42:19 2019] mce: [Hardware Error]: Machine check events logged
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c000050000800c1
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: TSC 0
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: ADDR 4c68f4000
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: MISC 90840800080208c
[Thu Mar 14 14:42:19 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1552574533 SOCKET 0 APIC 0
[Thu Mar 14 14:42:19 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4c68f4 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:1)

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

@jijiki: So there isn't a entry for replacing these this fiscal, but we may be able to order anyhow. I'll file a procurement task and add you to it shortly.

RobH mentioned this in T218323: reallocate former image scaler to thumbor use.Mar 14 2019, 5:22 PM

fresh Icinga alert for this since about 23 hours

@RobH do we have an update?

jijiki moved this task from Inbox 🐅 to St on the User-jijiki board.Apr 4 2019, 9:25 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-05T07:34:16Z] <jijiki> Repooling thumbor1004 until we replace its memory - T215411

The server is out of warranty, can we get a replacement or use a spare replacement?

In T215411#5116050, @Cmjohnson wrote:

The server is out of warranty, can we get a replacement or use a spare replacement?

Yes, per Robh:

In T215411#5024717, @RobH wrote:

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

RobH mentioned this in Unknown Object (Task).Apr 16 2019, 6:52 PM

RobH added a subtask: Unknown Object (Task).

• Cmjohnson moved this task from Hardware Failure / Troubleshoot to Stalled on the ops-eqiad board.Jun 11 2019, 4:16 PM

jijiki moved this task from Backlog to Doing on the Thumbor board.Jun 18 2019, 9:52 PM

jijiki moved this task from Doing 😎 to API Gateway 🥌 on the serviceops board.Jun 24 2019, 3:40 PM

wiki_willy renamed this task from thumbor1004 memory errors to (OoW) thumbor1004 memory errors.Jul 2 2019, 9:43 PM

jijiki moved this task from API Gateway 🥌 to Incoming 🐫 on the serviceops board.Jul 5 2019, 9:27 AM

Declining the task since the server is out of warranty.

server is still alerting https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=thumbor1004&service=Memory+correctable+errors+-EDAC-

needs a decom task instead then

jijiki changed the status of subtask Unknown Object (Task) from Open to Stalled.Sep 11 2019, 2:07 PM

server still alerting nowadays

I have disabled this check for now

In T215411#5024717, @RobH wrote:

So this has a memory error and is out of warranty.

This means we should look at decommissioning this host and ordering a replacement.

Created decom ticket per Server Lifecycle

Using the special Phabricator form for decom

As part of that we should run the decom script which will remove it from Icinga properly.

Dzahn added a subtask: T233827: decommission thumbor1004.Sep 25 2019, 5:05 PM

Dzahn mentioned this in T233827: decommission thumbor1004.

@Dzahn I do not know yet when this server will be decommissioned, we have quite some work ahead of us before moving thumbor to k8s

@jijiki I assumed it is broken anyways. Can it run despite the memory error?

Nevermind then, i declined the decom ticket again.

We have not noticed anything weird so far, I reckon it should be ok for a little longer

Ok, i misunderstood then.

jijiki closed subtask Unknown Object (Task) as Invalid.Nov 1 2019, 4:21 PM

Jelto reopened subtask Unknown Object (Task) as Open.Aug 18 2021, 2:42 PM

Jelto closed subtask Unknown Object (Task) as Invalid.Aug 18 2021, 2:49 PM

(OoW) thumbor1004 memory errorsClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

(OoW) thumbor1004 memory errors
Closed, DeclinedPublic
Actions

Related Objects
Search...