2019-01-25T18:04:26-0600 SYS1001 System is turning off. 2019-01-25T18:04:19-0600 SYS1003 System CPU Resetting. 2019-01-25T18:04:18-0600 RAC0703 Requested system hardreset. 2019-01-25T18:04:18-0600 SYS1003 System CPU Resetting. 2019-01-25T18:04:17-0600 PWR2262 The Intel Management Engine has reported an internal system error. Log Sequence Number: 136 Detailed Description: The Intel Management Engine is unable to utilize the PECI over DMI facility. Recommended Action: Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider. 2019-01-25T18:04:17-0600 CPU0000 Internal error has occurred check for additional logs. 2019-01-25T18:04:17-0600 LOG007 The previous log entry was repeated 1 times.
Description
Details
Related Objects
- Mentioned In
- T215611: MediaWiki errors overloading logstash
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2019-01-25T19:36:04Z] <jynus> powercycle db1114 T214720
Intel - "The Intel Management Engine is unable to utilize the PECI over DMI facility"
"We have a Dell PowerEdge R630 Server with Intel Xeon E5-2640 v4 CPUs and we are getting exactly the same error after updating from BIOS 2.3.4 to 2.6.0::
"We have a systemic issue we have experienced this over 100 times. Dell is engaged but we have not made much process. We were told to move from 2.4.3 to 2.6.0 by Dell, the condition still persisted. "
Reddit - Dell Poweredge R630 massive stability problems
db1114 - HW type: Dell PowerEdge R630
While a CPU failure should be "clean", with gtid and binlog_sync, it should be checked or reimaged before being repooled.
Change 486533 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash
Change 486533 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1106 as an extra api host after db1114 crash
@jcrespo +1 to reimage/reclone from an existing host (or mariabackup!)
@Cmjohnson I guess we need to open a case with Dell about this host and upgrade BIOS/firmware?
revision was not affected:
root@cumin1001:~$ ./wmfmariadbpy/wmfmariadbpy/compare.py enwiki revision rev_id db1067 db1114 --step=100000 [...] 2019-01-28T15:13:27.821882: row id 879900001/880627895, ETA: 00m04s, 0 chunk(s) found different Execution ended, no differences found.
Checking other tables now.
I checked the core tables at https://gerrit.wikimedia.org/r/486872
I propose to repool the host, as persistent replication stats worked well as expected:
https://gerrit.wikimedia.org/r/486525
Change 489189 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Depool db1114 - host down
The server went down at 12:16, with a number of memory errors logged in SEL:
------------------------------------------------------------------------------- Record: 2 Date/Time: 02/08/2019 12:16:05 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B7. ------------------------------------------------------------------------------- Record: 3 Date/Time: 02/08/2019 12:16:08 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B3. ------------------------------------------------------------------------------- Record: 4 Date/Time: 02/08/2019 12:16:12 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B7. ------------------------------------------------------------------------------- Record: 5 Date/Time: 02/08/2019 12:16:12 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3. ------------------------------------------------------------------------------- Record: 6 Date/Time: 02/08/2019 12:17:50 Source: system Severity: Ok Description: A problem was detected related to the previous server boot. ------------------------------------------------------------------------------- Record: 7 Date/Time: 02/08/2019 12:17:50 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3. -------------------------------------------------------------------------------
Change 489189 merged by Elukey:
[operations/mediawiki-config@master] Depool db1114 - host down
Change 489200 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load
Change 489210 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1114: Disable notifications
Change 489210 merged by Marostegui:
[operations/puppet@production] db1114: Disable notifications
Change 489213 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles
Change 489214 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Allow full reimage of db1114
Change 489214 merged by Jcrespo:
[operations/puppet@production] install_server: Allow full reimage of db1114
Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:
['db1118.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201902081510_jynus_47902.log.
Change 489213 merged by Jcrespo:
[operations/puppet@production] mariadb: Switch db1114 and db1118 roles
Mentioned in SAL (#wikimedia-operations) [2019-02-08T16:09:00Z] <jynus> stopping s1 replication on dbstore1001 to speed up cloning T214720
Change 489200 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool rc slaves with higher weight to rebalance load
Transfer of 1h and 20m, probably sped up because I stopped replication (avoiding to replay many changes).
Change 489280 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Enable notifications for db1118
Change 489281 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight
Change 489282 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight
Except for the above 3 patches, db1118 should be ready to go (not done so late in the week for obvious reasons).
Change 489280 merged by Jcrespo:
[operations/puppet@production] mariadb: Enable notifications for db1118
Change 489647 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts
Change 489647 merged by Jcrespo:
[operations/puppet@production] install_server: Remove db1118 from the list of automatic reimage hosts
Change 489281 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Introduce and pool db1118 with low weight
Change 489282 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Pool db1118 with full weight
Mentioned in SAL (#wikimedia-operations) [2019-02-13T17:49:46Z] <marostegui> Stop MYSQL on db1114 for onsite maintenance - T214720
I updated the bios to the latest version as of February 11, 2019 v2.9.1
updated idrac to latest version 2.61.60.60
@Cmjohnson should we also try to exchange the DIMM modules listed at T214720#4937872 and see if they fail again?
@Marostegui I will need to swap DIMM B3 and B7 to the A side. LMK when the server is down and ready
Mentioned in SAL (#wikimedia-operations) [2019-02-21T18:46:36Z] <jynus> shutting down db1114 T214720
Before DIMM Swap racadm log
/admin1-> racadm getsel
Record: 1
Date/Time: 11/04/2017 15:21:07
Source: system
Severity: Ok
Description: Log cleared.
Record: 2
Date/Time: 02/08/2019 12:16:05
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
Record: 3
Date/Time: 02/08/2019 12:16:08
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B3.
Record: 4
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
Record: 5
Date/Time: 02/08/2019 12:16:12
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
Record: 6
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Ok
Description: A problem was detected related to the previous server boot.
Record: 7
Date/Time: 02/08/2019 12:17:50
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
Record: 8
Date/Time: 02/11/2019 14:02:13
Source: system
Severity: Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
Record: 9
Date/Time: 02/11/2019 14:02:19
Source: system
Severity: Critical
Description: Correctable memory error rate exceeded for DIMM_B7.
@jynus @Marostegui I swapped DIMM B3 to A3 and B7 to A7 and cleared the idrac log. Please put some stress on the server and let's monitor.
Thanks, I have left it warming the buffer pool/replicating, tomorrow I will create a backup to touch all memory space.
@jcrespo maybe we can leave a mydumper running 24x7 on a loop for days on that host: dumping everything, deleting the backups file, dump everyting and so forth.
I am creating a snapshot right now for testing purposes, will run a dumping process next.
@Cmjohnson db1114 crashed again with the same memory errors on the same slots, so it looks like the mainboard memory slots aren't healthy?
Record: 1 Date/Time: 02/21/2019 19:30:12 Source: system Severity: Ok Description: Log cleared. ------------------------------------------------------------------------------- Record: 2 Date/Time: 02/23/2019 21:25:36 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B7. ------------------------------------------------------------------------------- Record: 3 Date/Time: 02/23/2019 21:25:37 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B3. ------------------------------------------------------------------------------- Record: 4 Date/Time: 02/23/2019 21:25:58 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B7. -------------------------------------------------------------------------------
It crashed again in less than 30 minutes after generating load:
2019-02-24T13:07:00-0600 USR0030 Successfully logged in using root, from 10.64.32.25 and GUI. 2019-02-24T12:29:47-0600 UEFI0081 Memory size has changed from the last time the system was started. 2019-02-24T12:29:47-0600 LOG007 The previous log entry was repeated 1 times. 2019-02-24T12:29:47-0600 UEFI0107 One or more memory errors have occurred on memory slot: B3. 2019-02-24T12:29:41-0600 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_B3. 2019-02-24T12:29:41-0600 PST0091 A problem was detected in Memory Reference Code (MRC). 2019-02-24T12:29:40-0600 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_B3. 2019-02-24T12:29:35-0600 UEFI0058 An uncorrectable Memory Error has occurred because a Dual Inline Memory Module (DIMM) is not functioning. 2019-02-24T12:29:35-0600 PST0091 A problem was detected in Memory Reference Code (MRC). 2019-02-24T12:28:26-0600 SYS1003 System CPU Resetting. 2019-02-24T12:28:26-0600 RAC0703 Requested system hardreset. 2019-02-24T12:28:26-0600 SYS1003 System CPU Resetting. 2019-02-24T12:28:25-0600 PWR2262 The Intel Management Engine has reported an internal system error. 2019-02-24T12:28:25-0600 CPU0000 Internal error has occurred check for additional logs. 2019-02-24T12:28:22-0600 PWR2262 The Intel Management Engine has reported an internal system error. 2019-02-24T12:27:58-0600 USR0032 The session for root from 10.64.32.25 using GUI is logged off.
This will most like need a new motherboard. I requested one through Dell
You have successfully submitted request SR986942076.
@Cmjohnson remember this host has MySQL down already, so you can just power it off yourself whenever you are ready for the mainboard replacement.
Thanks
Mentioned in SAL (#wikimedia-operations) [2019-02-28T15:15:42Z] <cmjohnson1> powering off db1114 to replace motherboard T214720
the motherboard has been replaced, the idrac and bios have been updated to latest version. resolving task, reopen if there are any problems.
@Marostegui I've chosen not to reimage the server because this is right now a backup testing one, I think it is ok if currently doesn't have the right enwiki data. Feel free to use it for dump testing. I have put it up, caught up replication, and started a backup process in a loop (on a screen session) to check it doesn't go down again, like in T214720#4978949
If you recover a snapshot to it, remember to kill the screen first.
No problem! let's leave the loop there for a few days to see if it crashes
Thank you!
Hi,
@Cmjohnson The remote IPMI password was out of sync. Just mentioning to add it on the to do list for motherboard changes (this and reviewing the boot order, which you did, thank you!). Not a huge issue, just a heads up to prevent people from bothering you (I was able to correct it from localhost).