Page MenuHomePhabricator

db2081 crashed/rebooted, probably due to hardware failure
Closed, ResolvedPublic

Description

Needs investigation by #DBAs first on why and mitigate the causes later.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 429573 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2081, crashed

https://gerrit.wikimedia.org/r/429573

jcrespo triaged this task as High priority.Apr 28 2018, 3:45 PM

Change 429573 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2081, crashed

https://gerrit.wikimedia.org/r/429573

Change 429575 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable notifications for db2081, crashed

https://gerrit.wikimedia.org/r/429575

Change 429575 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable notifications for db2081, crashed

https://gerrit.wikimedia.org/r/429575

Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a subscriber: Papaul.

Same error we experienced at: T175973#3615656

PWR2262: The Intel Management Engine has reported an internal system error.
 2018-04-28T15:16:13-0500
Log Sequence Number: 154
Detailed Description:
The Intel Management Engine is unable to utilize the PECI over DMI facility.
Recommended Action:
Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

That error was only visible via the idrac web interface. Nothing on the idrac shell. Nothing on the controller logs, syslog, remote syslog, kernel etc

@Papaul can we check and upgrade BIOS/firwmare and then do power disconnection+connection?

Okay will check that on Monday.

As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T162233)) all crashed with the same errors.

PWR2262: The Intel Management Engine has reported an internal system error.
 2018-04-28T15:16:13-0500
Log Sequence Number: 154
Detailed Description:
The Intel Management Engine is unable to utilize the PECI over DMI facility.
Recommended Action:
Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

db1100 never crashed again after doing the power drain.

Mentioned in SAL (#wikimedia-operations) [2018-04-30T14:26:34Z] <marostegui> Power off db2081 for HW maintenance - T193325

Change 429816 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2081: Reenable notifications

https://gerrit.wikimedia.org/r/429816

1- power disconnection+connection
2- update BIOS from 2.5.5 to 2.7.1
3- update IDRAC from 2.50 to 2.52

Thanks @Papaul - I have started MySQL to let it replicate for a couple of days before closing this.
I will leave the host depooled too, just in case.

MySQL and kernel have been upgraded too.

Change 429816 merged by Marostegui:
[operations/puppet@production] db2081: Reenable notifications

https://gerrit.wikimedia.org/r/429816

I am leaving a check ongoing on wikidatawiki on some codfw hosts to proof no data was lost.

The check detected some difference, but they could be false positives, checking again.

The check detected some difference, but they could be false positives, checking again.

Was it a false positive in the end?

Marostegui lowered the priority of this task from High to Medium.May 4 2018, 8:29 AM

Yes, no differences were assured on a second run. I will repool the server now.

Change 430894 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db2081 after crash & check

https://gerrit.wikimedia.org/r/430894

Change 430894 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db2081 after crash & check

https://gerrit.wikimedia.org/r/430894

Vvjjkkii renamed this task from db2081 crashed/rebooted, probably due to hardware failure to 60daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Marostegui as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 60daaaaaaa to db2081 crashed/rebooted, probably due to hardware failure.Jul 1 2018, 7:29 AM
Marostegui closed this task as Resolved.
Marostegui claimed this task.
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)
Marostegui added subscribers: Aklapper, GerritBot.