Page MenuHomePhabricator

db1114 crashed (HW memory issues)
Closed, ResolvedPublic

Description

 	  	2019-01-25T18:04:26-0600	SYS1001	
System is turning off.
	
 
 	  	2019-01-25T18:04:19-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-01-25T18:04:18-0600	RAC0703	
Requested system hardreset.
	
 
 	  	2019-01-25T18:04:18-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-01-25T18:04:17-0600	PWR2262	
The Intel Management Engine has reported an internal system error.

Log Sequence Number: 136
Detailed Description:
The Intel Management Engine is unable to utilize the PECI over DMI facility.
Recommended Action:
Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

 
 	  	2019-01-25T18:04:17-0600	CPU0000	
Internal error has occurred check for additional logs.
	
 
 	  	2019-01-25T18:04:17-0600	LOG007	
The previous log entry was repeated 1 times.

Event Timeline

Intel - "The Intel Management Engine is unable to utilize the PECI over DMI facility"

"We have a Dell PowerEdge R630 Server with Intel Xeon E5-2640 v4 CPUs and we are getting exactly the same error after updating from BIOS 2.3.4 to 2.6.0::

"We have a systemic issue we have experienced this over 100 times. Dell is engaged but we have not made much process. We were told to move from 2.4.3 to 2.6.0 by Dell, the condition still persisted. "

Reddit - Dell Poweredge R630 massive stability problems

db1114 - HW type: Dell PowerEdge R630

While a CPU failure should be "clean", with gtid and binlog_sync, it should be checked or reimaged before being repooled.

Change 486533 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash

https://gerrit.wikimedia.org/r/486533

Change 486533 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash

https://gerrit.wikimedia.org/r/486533

@jcrespo +1 to reimage/reclone from an existing host (or mariabackup!)

@Cmjohnson I guess we need to open a case with Dell about this host and upgrade BIOS/firmware?

revision was not affected:

root@cumin1001:~$ ./wmfmariadbpy/wmfmariadbpy/compare.py enwiki revision rev_id db1067 db1114 --step=100000
[...]
2019-01-28T15:13:27.821882: row id 879900001/880627895, ETA: 00m04s, 0 chunk(s) found different
Execution ended, no differences found.

Checking other tables now.

I checked the core tables at https://gerrit.wikimedia.org/r/486872

I propose to repool the host, as persistent replication stats worked well as expected:
https://gerrit.wikimedia.org/r/486525

Go for it!
Thanks for checking it!

@jcrespo I can update f/w if you need but you will need to depool the host again

It is ok, will that be next week, e.g. Tuesday?

Change 489189 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Depool db1114 - host down

https://gerrit.wikimedia.org/r/489189

The server went down at 12:16, with a number of memory errors logged in SEL:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/08/2019 12:16:05
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/08/2019 12:16:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/08/2019 12:16:12
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/08/2019 12:16:12
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/08/2019 12:17:50
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/08/2019 12:17:50
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------

Change 489189 merged by Elukey:
[operations/mediawiki-config@master] Depool db1114 - host down

https://gerrit.wikimedia.org/r/489189

Change 489200 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load

https://gerrit.wikimedia.org/r/489200

Change 489210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1114: Disable notifications

https://gerrit.wikimedia.org/r/489210

Change 489210 merged by Marostegui:
[operations/puppet@production] db1114: Disable notifications

https://gerrit.wikimedia.org/r/489210

Change 489213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles

https://gerrit.wikimedia.org/r/489213

Change 489214 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Allow full reimage of db1114

https://gerrit.wikimedia.org/r/489214

Change 489214 merged by Jcrespo:
[operations/puppet@production] install_server: Allow full reimage of db1114

https://gerrit.wikimedia.org/r/489214

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1118.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201902081510_jynus_47902.log.

Change 489213 merged by Jcrespo:
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles

https://gerrit.wikimedia.org/r/489213

Completed auto-reimage of hosts:

['db1118.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-02-08T16:09:00Z] <jynus> stopping s1 replication on dbstore1001 to speed up cloning T214720

Change 489200 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load

https://gerrit.wikimedia.org/r/489200

Transfer of 1h and 20m, probably sped up because I stopped replication (avoiding to replay many changes).

Change 489280 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable notifications for db1118

https://gerrit.wikimedia.org/r/489280

Change 489281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight

https://gerrit.wikimedia.org/r/489281

Change 489282 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight

https://gerrit.wikimedia.org/r/489282

Except for the above 3 patches, db1118 should be ready to go (not done so late in the week for obvious reasons).

Change 489280 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable notifications for db1118

https://gerrit.wikimedia.org/r/489280

Change 489647 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts

https://gerrit.wikimedia.org/r/489647

Change 489647 merged by Jcrespo:
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts

https://gerrit.wikimedia.org/r/489647

Change 489281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight

https://gerrit.wikimedia.org/r/489281

Change 489282 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight

https://gerrit.wikimedia.org/r/489282

Mentioned in SAL (#wikimedia-operations) [2019-02-13T17:49:46Z] <marostegui> Stop MYSQL on db1114 for onsite maintenance - T214720

I updated the bios to the latest version as of February 11, 2019 v2.9.1
updated idrac to latest version 2.61.60.60

@Cmjohnson should we also try to exchange the DIMM modules listed at T214720#4937872 and see if they fail again?

Marostegui renamed this task from db1114 crashed to db1114 crashed (HW memory issues).Feb 21 2019, 2:37 PM

@Marostegui I will need to swap DIMM B3 and B7 to the A side. LMK when the server is down and ready

I will put it down now (it is out of service, I only need to downtime it on icinga)

Before DIMM Swap racadm log

/admin1-> racadm getsel
Record: 1
Date/Time: 11/04/2017 15:21:07
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 02/08/2019 12:16:05
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 3
Date/Time: 02/08/2019 12:16:08
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B3.

Record: 4
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 5
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Record: 6
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 7
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Record: 8
Date/Time: 02/11/2019 14:02:13
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 9
Date/Time: 02/11/2019 14:02:19
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

@jynus @Marostegui I swapped DIMM B3 to A3 and B7 to A7 and cleared the idrac log. Please put some stress on the server and let's monitor.

Thanks, I have left it warming the buffer pool/replicating, tomorrow I will create a backup to touch all memory space.

@jcrespo maybe we can leave a mydumper running 24x7 on a loop for days on that host: dumping everything, deleting the backups file, dump everyting and so forth.

I am creating a snapshot right now for testing purposes, will run a dumping process next.

@Cmjohnson db1114 crashed again with the same memory errors on the same slots, so it looks like the mainboard memory slots aren't healthy?

Record:      1
Date/Time:   02/21/2019 19:30:12
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/23/2019 21:25:36
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/23/2019 21:25:37
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/23/2019 21:25:58
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------

It crashed again in less than 30 minutes after generating load:

 	  	2019-02-24T13:07:00-0600	USR0030	
Successfully logged in using root, from 10.64.32.25 and GUI.
	
 
 	  	2019-02-24T12:29:47-0600	UEFI0081	
Memory size has changed from the last time the system was started.
	
 
 	  	2019-02-24T12:29:47-0600	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2019-02-24T12:29:47-0600	UEFI0107	
One or more memory errors have occurred on memory slot: B3.
	
 
 	  	2019-02-24T12:29:41-0600	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
	
 
 	  	2019-02-24T12:29:41-0600	PST0091	
A problem was detected in Memory Reference Code (MRC).
	
 
 	  	2019-02-24T12:29:40-0600	MEM0001	
Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
	
 
 	  	2019-02-24T12:29:35-0600	UEFI0058	
An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning.
	
 
 	  	2019-02-24T12:29:35-0600	PST0091	
A problem was detected in Memory Reference Code (MRC).
	
 
 	  	2019-02-24T12:28:26-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-02-24T12:28:26-0600	RAC0703	
Requested system hardreset.
	
 
 	  	2019-02-24T12:28:26-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-02-24T12:28:25-0600	PWR2262	
The Intel Management Engine has reported an internal system error.
	
 
 	  	2019-02-24T12:28:25-0600	CPU0000	
Internal error has occurred check for additional logs.
	
 
 	  	2019-02-24T12:28:22-0600	PWR2262	
The Intel Management Engine has reported an internal system error.
	
 
 	  	2019-02-24T12:27:58-0600	USR0032	
The session for root from 10.64.32.25 using GUI is logged off.

This will most like need a new motherboard. I requested one through Dell

You have successfully submitted request SR986942076.

a new motherboard arrives tomorrow 28/2/2019 to be replaced.

@Cmjohnson remember this host has MySQL down already, so you can just power it off yourself whenever you are ready for the mainboard replacement.
Thanks

Mentioned in SAL (#wikimedia-operations) [2019-02-28T15:15:42Z] <cmjohnson1> powering off db1114 to replace motherboard T214720

the motherboard has been replaced, the idrac and bios have been updated to latest version. resolving task, reopen if there are any problems.

@Marostegui I've chosen not to reimage the server because this is right now a backup testing one, I think it is ok if currently doesn't have the right enwiki data. Feel free to use it for dump testing. I have put it up, caught up replication, and started a backup process in a loop (on a screen session) to check it doesn't go down again, like in T214720#4978949

If you recover a snapshot to it, remember to kill the screen first.

No problem! let's leave the loop there for a few days to see if it crashes
Thank you!

Hi,

@Cmjohnson The remote IPMI password was out of sync. Just mentioning to add it on the to do list for motherboard changes (this and reviewing the boot order, which you did, thank you!). Not a huge issue, just a heads up to prevent people from bothering you (I was able to correct it from localhost).