db1114 crashed (HW memory issues)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Jan 25 2019, 7:29 PM

Description

 	  	2019-01-25T18:04:26-0600	SYS1001	
System is turning off.
	
 
 	  	2019-01-25T18:04:19-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-01-25T18:04:18-0600	RAC0703	
Requested system hardreset.
	
 
 	  	2019-01-25T18:04:18-0600	SYS1003	
System CPU Resetting.
	
 
 	  	2019-01-25T18:04:17-0600	PWR2262	
The Intel Management Engine has reported an internal system error.

Log Sequence Number: 136
Detailed Description:
The Intel Management Engine is unable to utilize the PECI over DMI facility.
Recommended Action:
Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

 
 	  	2019-01-25T18:04:17-0600	CPU0000	
Internal error has occurred check for additional logs.
	
 
 	  	2019-01-25T18:04:17-0600	LOG007	
The previous log entry was repeated 1 times.

Details

Subject	Repo	Branch	Lines +/-
mariadb: Pool db1118 with full weight	operations/mediawiki-config	master	+3 -3
mariadb: Introduce and pool db1118 with low weight	operations/mediawiki-config	master	+6 -6
install_server: Remove db1118 from the list of automatic reimage hosts	operations/puppet	production	+1 -1
mariadb: Enable notifications for db1118	operations/puppet	production	+0 -1
mariadb: Pool rc slaves with higher weight to rebalance load	operations/mediawiki-config	master	+3 -3
mariadb: Switch db1114 and db1118 roles	operations/puppet	production	+6 -6
install_server: Allow full reimage of db1114	operations/puppet	production	+1 -1
db1114: Disable notifications	operations/puppet	production	+1 -0
Depool db1114 - host down	operations/mediawiki-config	master	+2 -2
mariadb: Pool db1106 as an extra api host after db1114 crash	operations/mediawiki-config	master	+2 -1

Customize query in gerrit

Related Objects

Mentioned In: T215611: MediaWiki errors overloading logstash

Event Timeline

• jcrespo created this task.Jan 25 2019, 7:29 PM

Restricted Application added a project: SRE. · View Herald TranscriptJan 25 2019, 7:29 PM

• jcrespo added subscribers: Marostegui, • Cmjohnson.Jan 25 2019, 7:30 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-25T19:36:04Z] <jynus> powercycle db1114 T214720

Intel - "The Intel Management Engine is unable to utilize the PECI over DMI facility"

"We have a Dell PowerEdge R630 Server with Intel Xeon E5-2640 v4 CPUs and we are getting exactly the same error after updating from BIOS 2.3.4 to 2.6.0::

"We have a systemic issue we have experienced this over 100 times. Dell is engaged but we have not made much process. We were told to move from 2.4.3 to 2.6.0 by Dell, the condition still persisted. "

Reddit - Dell Poweredge R630 massive stability problems

db1114 - HW type: Dell PowerEdge R630

While a CPU failure should be "clean", with gtid and binlog_sync, it should be checked or reimaged before being repooled.

Change 486533 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash

https://gerrit.wikimedia.org/r/486533

gerritbot added a project: Patch-For-Review.Jan 25 2019, 7:56 PM

Change 486533 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash

https://gerrit.wikimedia.org/r/486533

@jcrespo +1 to reimage/reclone from an existing host (or mariabackup!)

@Cmjohnson I guess we need to open a case with Dell about this host and upgrade BIOS/firmware?

Marostegui moved this task from Triage to In progress on the DBA board.Jan 26 2019, 3:06 AM

revision was not affected:

root@cumin1001:~$ ./wmfmariadbpy/wmfmariadbpy/compare.py enwiki revision rev_id db1067 db1114 --step=100000
[...]
2019-01-28T15:13:27.821882: row id 879900001/880627895, ETA: 00m04s, 0 chunk(s) found different
Execution ended, no differences found.

Checking other tables now.

I checked the core tables at https://gerrit.wikimedia.org/r/486872

I propose to repool the host, as persistent replication stats worked well as expected:
https://gerrit.wikimedia.org/r/486525

Go for it!
Thanks for checking it!

db1114 is repooled.

• jcrespo moved this task from In progress to Blocked external/Not db team on the DBA board.Jan 29 2019, 8:24 AM

@jcrespo I can update f/w if you need but you will need to depool the host again

• Cmjohnson moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Jan 30 2019, 9:55 PM

It is ok, will that be next week, e.g. Tuesday?

Marostegui triaged this task as Medium priority.Feb 5 2019, 6:53 AM

Change 489189 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Depool db1114 - host down

https://gerrit.wikimedia.org/r/489189

The server went down at 12:16, with a number of memory errors logged in SEL:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/08/2019 12:16:05
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/08/2019 12:16:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/08/2019 12:16:12
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   02/08/2019 12:16:12
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   02/08/2019 12:17:50
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   02/08/2019 12:17:50
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------

Change 489189 merged by Elukey:
[operations/mediawiki-config@master] Depool db1114 - host down

https://gerrit.wikimedia.org/r/489189

Change 489200 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load

https://gerrit.wikimedia.org/r/489200

Marostegui mentioned this in T215611: MediaWiki errors overloading logstash.Feb 8 2019, 1:47 PM

Change 489210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1114: Disable notifications

https://gerrit.wikimedia.org/r/489210

Change 489210 merged by Marostegui:
[operations/puppet@production] db1114: Disable notifications

https://gerrit.wikimedia.org/r/489210

Change 489213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles

https://gerrit.wikimedia.org/r/489213

Change 489214 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Allow full reimage of db1114

https://gerrit.wikimedia.org/r/489214

Change 489214 merged by Jcrespo:
[operations/puppet@production] install_server: Allow full reimage of db1114

https://gerrit.wikimedia.org/r/489214

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1118.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201902081510_jynus_47902.log.

Change 489213 merged by Jcrespo:
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles

https://gerrit.wikimedia.org/r/489213

Completed auto-reimage of hosts:

['db1118.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-02-08T16:09:00Z] <jynus> stopping s1 replication on dbstore1001 to speed up cloning T214720

Change 489200 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load

https://gerrit.wikimedia.org/r/489200

Transfer of 1h and 20m, probably sped up because I stopped replication (avoiding to replay many changes).

Change 489280 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable notifications for db1118

https://gerrit.wikimedia.org/r/489280

Change 489281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight

https://gerrit.wikimedia.org/r/489281

Change 489282 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight

https://gerrit.wikimedia.org/r/489282

Except for the above 3 patches, db1118 should be ready to go (not done so late in the week for obvious reasons).

Change 489280 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable notifications for db1118

https://gerrit.wikimedia.org/r/489280

Change 489647 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts

https://gerrit.wikimedia.org/r/489647

Change 489647 merged by Jcrespo:
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts

https://gerrit.wikimedia.org/r/489647

Change 489281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight

https://gerrit.wikimedia.org/r/489281

Change 489282 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight

https://gerrit.wikimedia.org/r/489282

Mentioned in SAL (#wikimedia-operations) [2019-02-13T17:49:46Z] <marostegui> Stop MYSQL on db1114 for onsite maintenance - T214720

I updated the bios to the latest version as of February 11, 2019 v2.9.1
updated idrac to latest version 2.61.60.60

Thanks, Chris!

@Cmjohnson should we also try to exchange the DIMM modules listed at T214720#4937872 and see if they fail again?

Marostegui renamed this task from db1114 crashed to db1114 crashed (HW memory issues).Feb 21 2019, 2:37 PM

@Marostegui I will need to swap DIMM B3 and B7 to the A side. LMK when the server is down and ready

I will put it down now (it is out of service, I only need to downtime it on icinga)

Mentioned in SAL (#wikimedia-operations) [2019-02-21T18:46:36Z] <jynus> shutting down db1114 T214720

Before DIMM Swap racadm log

/admin1-> racadm getsel
Record: 1
Date/Time: 11/04/2017 15:21:07
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 02/08/2019 12:16:05
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 3
Date/Time: 02/08/2019 12:16:08
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B3.

Record: 4
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 5
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Record: 6
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 7
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Critical

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Record: 8
Date/Time: 02/11/2019 14:02:13
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

Record: 9
Date/Time: 02/11/2019 14:02:19
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B7.

@jynus @Marostegui I swapped DIMM B3 to A3 and B7 to A7 and cleared the idrac log. Please put some stress on the server and let's monitor.

Thanks, I have left it warming the buffer pool/replicating, tomorrow I will create a backup to touch all memory space.

@jcrespo maybe we can leave a mydumper running 24x7 on a loop for days on that host: dumping everything, deleting the backups file, dump everyting and so forth.

I am creating a snapshot right now for testing purposes, will run a dumping process next.

@Cmjohnson db1114 crashed again with the same memory errors on the same slots, so it looks like the mainboard memory slots aren't healthy?

Record:      1
Date/Time:   02/21/2019 19:30:12
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/23/2019 21:25:36
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/23/2019 21:25:37
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   02/23/2019 21:25:58
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
-------------------------------------------------------------------------------

It crashed again in less than 30 minutes after generating load:

2019-02-24T13:07:00-0600 USR0030
Successfully logged in using root, from 10.64.32.25 and GUI.

2019-02-24T12:29:47-0600 UEFI0081
Memory size has changed from the last time the system was started.

2019-02-24T12:29:47-0600 LOG007
The previous log entry was repeated 1 times.

2019-02-24T12:29:47-0600 UEFI0107
One or more memory errors have occurred on memory slot: B3.

2019-02-24T12:29:41-0600 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

2019-02-24T12:29:41-0600 PST0091
A problem was detected in Memory Reference Code (MRC).

2019-02-24T12:29:40-0600 MEM0001
Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

2019-02-24T12:29:35-0600 UEFI0058
An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning.

2019-02-24T12:29:35-0600 PST0091
A problem was detected in Memory Reference Code (MRC).

2019-02-24T12:28:26-0600 SYS1003
System CPU Resetting.

2019-02-24T12:28:26-0600 RAC0703
Requested system hardreset.

2019-02-24T12:28:26-0600 SYS1003
System CPU Resetting.

2019-02-24T12:28:25-0600 PWR2262
The Intel Management Engine has reported an internal system error.

2019-02-24T12:28:25-0600 CPU0000
Internal error has occurred check for additional logs.

2019-02-24T12:28:22-0600 PWR2262
The Intel Management Engine has reported an internal system error.

2019-02-24T12:27:58-0600 USR0032
The session for root from 10.64.32.25 using GUI is logged off.

This will most like need a new motherboard. I requested one through Dell

You have successfully submitted request SR986942076.

Excellent! Thank you Chris!

a new motherboard arrives tomorrow 28/2/2019 to be replaced.

@Cmjohnson remember this host has MySQL down already, so you can just power it off yourself whenever you are ready for the mainboard replacement.
Thanks

Mentioned in SAL (#wikimedia-operations) [2019-02-28T15:15:42Z] <cmjohnson1> powering off db1114 to replace motherboard T214720

the motherboard has been replaced, the idrac and bios have been updated to latest version. resolving task, reopen if there are any problems.

@Marostegui I've chosen not to reimage the server because this is right now a backup testing one, I think it is ok if currently doesn't have the right enwiki data. Feel free to use it for dump testing. I have put it up, caught up replication, and started a backup process in a loop (on a screen session) to check it doesn't go down again, like in T214720#4978949

If you recover a snapshot to it, remember to kill the screen first.

No problem! let's leave the loop there for a few days to see if it crashes
Thank you!

Hi,

@Cmjohnson The remote IPMI password was out of sync. Just mentioning to add it on the to do list for motherboard changes (this and reviewing the boot order, which you did, thank you!). Not a huge issue, just a heads up to prevent people from bothering you (I was able to correct it from localhost).

db1114 crashed (HW memory issues)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Description: Log cleared.

Description: Correctable memory error rate exceeded for DIMM_B7.

Description: Correctable memory error rate exceeded for DIMM_B3.

Description: Correctable memory error rate exceeded for DIMM_B7.

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Description: A problem was detected related to the previous server boot.

Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.

Description: Correctable memory error rate exceeded for DIMM_B7.

Description: Correctable memory error rate exceeded for DIMM_B7.

db1114 crashed (HW memory issues)
Closed, ResolvedPublic
Actions