Page MenuHomePhabricator

db1114 crashed (HW memory issues)
Open, NormalPublic

Description

 	  	2019-01-25T18:04:26-0600	SYS1001	
System is turning off.
	
 
 	  	2019-01-25T18:04:19-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-01-25T18:04:18-0600	RAC0703	
Requested system hardreset.
	
 
 	  	2019-01-25T18:04:18-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-01-25T18:04:17-0600	PWR2262	
The Intel Management Engine has reported an internal system error.

Log Sequence Number: 136
Detailed Description:
The Intel Management Engine is unable to utilize the PECI over DMI facility.
Recommended Action:
Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

 
 	  	2019-01-25T18:04:17-0600	CPU0000	
Internal error has occurred check for additional logs.
	
 
 	  	2019-01-25T18:04:17-0600	LOG007	
The previous log entry was repeated 1 times.

Event Timeline

jcrespo created this task.Fri, Jan 25, 7:29 PM
Restricted Application added a project: Operations. · View Herald TranscriptFri, Jan 25, 7:29 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-25T19:36:04Z] <jynus> powercycle db1114 T214720

Dzahn added a subscriber: Dzahn.Fri, Jan 25, 7:39 PM

Intel - "The Intel Management Engine is unable to utilize the PECI over DMI facility"

"We have a Dell PowerEdge R630 Server with Intel Xeon E5-2640 v4 CPUs and we are getting exactly the same error after updating from BIOS 2.3.4 to 2.6.0::

"We have a systemic issue we have experienced this over 100 times. Dell is engaged but we have not made much process. We were told to move from 2.4.3 to 2.6.0 by Dell, the condition still persisted. "

Reddit - Dell Poweredge R630 massive stability problems

db1114 - HW type: Dell PowerEdge R630

While a CPU failure should be "clean", with gtid and binlog_sync, it should be checked or reimaged before being repooled.

Change 486533 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash

https://gerrit.wikimedia.org/r/486533

Change 486533 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash

https://gerrit.wikimedia.org/r/486533

@jcrespo +1 to reimage/reclone from an existing host (or mariabackup!)

@Cmjohnson I guess we need to open a case with Dell about this host and upgrade BIOS/firmware?

Marostegui moved this task from Triage to In progress on the DBA board.Sat, Jan 26, 3:06 AM

revision was not affected:

root@cumin1001:~$ ./wmfmariadbpy/wmfmariadbpy/compare.py enwiki revision rev_id db1067 db1114 --step=100000
[...]
2019-01-28T15:13:27.821882: row id 879900001/880627895, ETA: 00m04s, 0 chunk(s) found different
Execution ended, no differences found.

Checking other tables now.

I checked the core tables at https://gerrit.wikimedia.org/r/486872

I propose to repool the host, as persistent replication stats worked well as expected:
https://gerrit.wikimedia.org/r/486525

Go for it!
Thanks for checking it!

db1114 is repooled.

@jcrespo I can update f/w if you need but you will need to depool the host again

It is ok, will that be next week, e.g. Tuesday?

Marostegui triaged this task as Normal priority.Tue, Feb 5, 6:53 AM

Change 489189 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Depool db1114 - host down

https://gerrit.wikimedia.org/r/489189

The server went down at 12:16, with a number of memory errors logged in SEL:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/08/2019 12:16:05
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/08/2019 12:16:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/08/2019 12:16:12
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/08/2019 12:16:12
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/08/2019 12:17:50
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/08/2019 12:17:50
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------

Change 489189 merged by Elukey:
[operations/mediawiki-config@master] Depool db1114 - host down

https://gerrit.wikimedia.org/r/489189

Change 489200 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load

https://gerrit.wikimedia.org/r/489200

Change 489210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1114: Disable notifications

https://gerrit.wikimedia.org/r/489210

Change 489210 merged by Marostegui:
[operations/puppet@production] db1114: Disable notifications

https://gerrit.wikimedia.org/r/489210

Change 489213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles

https://gerrit.wikimedia.org/r/489213

Change 489214 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Allow full reimage of db1114

https://gerrit.wikimedia.org/r/489214

Change 489214 merged by Jcrespo:
[operations/puppet@production] install_server: Allow full reimage of db1114

https://gerrit.wikimedia.org/r/489214

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1118.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201902081510_jynus_47902.log.

Change 489213 merged by Jcrespo:
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles

https://gerrit.wikimedia.org/r/489213

Completed auto-reimage of hosts:

['db1118.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-02-08T16:09:00Z] <jynus> stopping s1 replication on dbstore1001 to speed up cloning T214720

Change 489200 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load

https://gerrit.wikimedia.org/r/489200

Transfer of 1h and 20m, probably sped up because I stopped replication (avoiding to replay many changes).

Change 489280 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable notifications for db1118

https://gerrit.wikimedia.org/r/489280

Change 489281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight

https://gerrit.wikimedia.org/r/489281

Change 489282 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight

https://gerrit.wikimedia.org/r/489282

Except for the above 3 patches, db1118 should be ready to go (not done so late in the week for obvious reasons).

Change 489280 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable notifications for db1118

https://gerrit.wikimedia.org/r/489280

Change 489647 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts

https://gerrit.wikimedia.org/r/489647

Change 489647 merged by Jcrespo:
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts

https://gerrit.wikimedia.org/r/489647

Change 489281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight

https://gerrit.wikimedia.org/r/489281

Change 489282 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight

https://gerrit.wikimedia.org/r/489282

Mentioned in SAL (#wikimedia-operations) [2019-02-13T17:49:46Z] <marostegui> Stop MYSQL on db1114 for onsite maintenance - T214720

I updated the bios to the latest version as of February 11, 2019 v2.9.1
updated idrac to latest version 2.61.60.60

Thanks, Chris!

@Cmjohnson should we also try to exchange the DIMM modules listed at T214720#4937872 and see if they fail again?

Marostegui renamed this task from db1114 crashed to db1114 crashed (HW memory issues).Thu, Feb 21, 2:37 PM

@Marostegui I will need to swap DIMM B3 and B7 to the A side. LMK when the server is down and ready

I will put it down now (it is out of service, I only need to downtime it on icinga)

Mentioned in SAL (#wikimedia-operations) [2019-02-21T18:46:36Z] <jynus> shutting down db1114 T214720

Before DIMM Swap racadm log

/admin1-> racadm getsel
Record: 1
Date/Time: 11/04/2017 15:21:07
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 02/08/2019 12:16:05
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 3
Date/Time: 02/08/2019 12:16:08
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B3.

Record: 4
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 5
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Record: 6
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 7
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Record: 8
Date/Time: 02/11/2019 14:02:13
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 9
Date/Time: 02/11/2019 14:02:19
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

@jynus @Marostegui I swapped DIMM B3 to A3 and B7 to A7 and cleared the idrac log. Please put some stress on the server and let's monitor.

Thanks, I have left it warming the buffer pool/replicating, tomorrow I will create a backup to touch all memory space.

@jcrespo maybe we can leave a mydumper running 24x7 on a loop for days on that host: dumping everything, deleting the backups file, dump everyting and so forth.

I am creating a snapshot right now for testing purposes, will run a dumping process next.