Needs investigation by #DBAs first on why and mitigate the causes later.
Description
Details
Related Objects
Event Timeline
Change 429573 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2081, crashed
Change 429573 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2081, crashed
Change 429575 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable notifications for db2081, crashed
Change 429575 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable notifications for db2081, crashed
Same error we experienced at: T175973#3615656
PWR2262: The Intel Management Engine has reported an internal system error. 2018-04-28T15:16:13-0500 Log Sequence Number: 154 Detailed Description: The Intel Management Engine is unable to utilize the PECI over DMI facility. Recommended Action: Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.
That error was only visible via the idrac web interface. Nothing on the idrac shell. Nothing on the controller logs, syslog, remote syslog, kernel etc
@Papaul can we check and upgrade BIOS/firwmare and then do power disconnection+connection?
As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T162233)) all crashed with the same errors.
PWR2262: The Intel Management Engine has reported an internal system error. 2018-04-28T15:16:13-0500 Log Sequence Number: 154 Detailed Description: The Intel Management Engine is unable to utilize the PECI over DMI facility. Recommended Action: Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.
db1100 never crashed again after doing the power drain.
Mentioned in SAL (#wikimedia-operations) [2018-04-30T14:26:34Z] <marostegui> Power off db2081 for HW maintenance - T193325
Change 429816 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2081: Reenable notifications
1- power disconnection+connection
2- update BIOS from 2.5.5 to 2.7.1
3- update IDRAC from 2.50 to 2.52
Thanks @Papaul - I have started MySQL to let it replicate for a couple of days before closing this.
I will leave the host depooled too, just in case.
MySQL and kernel have been upgraded too.
Change 429816 merged by Marostegui:
[operations/puppet@production] db2081: Reenable notifications
I am leaving a check ongoing on wikidatawiki on some codfw hosts to proof no data was lost.
The check detected some difference, but they could be false positives, checking again.
Change 430894 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db2081 after crash & check
Change 430894 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db2081 after crash & check