Page MenuHomePhabricator

db1098 crashed and got rebooted
Closed, ResolvedPublic

Description

I didn't see any relevant log on the host. We've depooled the host from all mediawiki sections: https://gerrit.wikimedia.org/r/#/c/429642/

Here the relevant hardware logs from racadm getraclog.

--------------------------------------------------------------------------------

SeqNumber       = 167

Message ID      = PWR2262

Category        = Audit

AgentID         = iDRAC

Severity        = Warning

Timestamp       = 2018-04-29 00:13:17

Message         = The Intel Management Engine has reported an internal system error.

Message Arg   1 = Error Code = [0x03]

FQDD            = iDRAC.Embedded.1

--------------------------------------------------------------------------------

SeqNumber       = 168

Message ID      = RAC0703

Category        = Audit

AgentID         = RACLOG

Severity        = Information

Timestamp       = 2018-04-29 00:13:18

Message         = Requested system hardreset.

FQDD            = iDRAC.Embedded.1

--------------------------------------------------------------------------------

Event Timeline

Volans created this task.Apr 28 2018, 11:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2018, 11:55 PM
Volans triaged this task as High priority.Apr 28 2018, 11:57 PM

I've downtimed db1098 on Icinga until Wed mid EU day and disabled notifications.

This is the same error as db2081 earlier today: T193325

The Intel Management Engine has recovered the ability to utilize the PECI over DMI facility.

If the PWR2262 "internal system error" was logged within the previous 30 seconds, ignore the message. No response action required.

T175973#3615656 db1100 suffered it too which is the same batch as db1098

 	  	2018-04-28T23:28:04-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2018-04-29T00:13:43-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-04-29T00:13:42-0500	SYS1000	
System is turning on.
	
 
 	  	2018-04-29T00:13:34-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-04-29T00:13:34-0500	SYS1001	
System is turning off.
	
 
 	  	2018-04-29T00:13:30-0500	PWR2264	
The Intel Management Engine has reported normal system operation.
	
 
 	  	2018-04-29T00:13:18-0500	RAC0703	
Requested system hardreset.
	
 
 	  	2018-04-29T00:13:17-0500	PWR2262	
The Intel Management Engine has reported an internal system error.
	
 
 	  	2018-04-29T00:13:17-0500	CPU0000	
Internal error has occurred check for additional logs.
Marostegui added a subscriber: Cmjohnson.

@Cmjohnson can we do the same thing we did to db1100? (which had never had another crash ever since):

  • Check if there are BIOS/firmware updates available
  • Power drain the host (T175973#3617612)

Let's try to do this on Monday if possible, as Tuesday is a public holiday and we shouldn't leave the host out and off for so long.

Marostegui moved this task from Triage to In progress on the DBA board.Apr 29 2018, 6:45 AM
Marostegui added a project: ops-eqiad.

Change 429652 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1098.yaml: Disable notifications

https://gerrit.wikimedia.org/r/429652

Change 429652 merged by Marostegui:
[operations/puppet@production] db1098.yaml: Disable notifications

https://gerrit.wikimedia.org/r/429652

I have started MySQL on db1098 to:

  • Make sure nothing is corrupted and replication can flow
  • Avoid leaving the host to fall behind replication for 2 days.

I will stop MySQL before @Cmjohnson wakes up, so it is all ready for him to take over on Monday without waiting on us to stop MySQL again.

It will remain depooled till maintenance has happened on it.

As a side note. Either db1098 (T193331), db2081 (T193325) and db1100 (T175973) (they were all coming from the same batch of purchases (T162159 and T162233)) all crashed with the same errors.

PWR2262: The Intel Management Engine has reported an internal system error.
 2018-04-28T15:16:13-0500
Log Sequence Number: 154
Detailed Description:
The Intel Management Engine is unable to utilize the PECI over DMI facility.
Recommended Action:
Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider.

db1100 never crashed again after doing the power drain.

Mentioned in SAL (#wikimedia-operations) [2018-04-30T13:26:04Z] <marostegui> Stop MySQL on db1098 - T193331

Mentioned in SAL (#wikimedia-operations) [2018-04-30T13:30:17Z] <marostegui> Poweroff db1098 for HW maintenance - T193331

drained flea power, updated bios and idrac f/w to and powered back on

BIOS Version 2.7.1
Firmware Version 2.52.52.52

Thank you @Cmjohnson
I have started MySQL to let it catch up and replicate for a day before repooling it.

MySQL and kernel have been upgraded too

This can be merged once we are ready to repool: https://gerrit.wikimedia.org/r/#/c/430032/

s6 main_tables.txt have been checked, no errors found, now checking s7 instance:

$ cat s7.dblist | while read db; do cat main_tables.txt | while read table index; do echo "$db.$table..."; ./compare.py $db $table $index db1062.eqiad.wmnet db1098.eqiad.wmnet:3317 db1090.eqiad.wmnet:3317 || break; done; done
arwiki.archive...
Starting comparison between id 1 and 3593418
2018-05-02T10:38:08.069550: row id 990001/3593418, ETA: 00m31s, 0 chunk(s) found different
2018-05-02T10:38:19.449771: row id 1990001/3593418, ETA: 00m18s, 0 chunk(s) found different
2018-05-02T10:38:29.727766: row id 2990001/3593418, ETA: 00m06s, 0 chunk(s) found different
Execution ended, no differences found.
arwiki.logging...
Cmjohnson moved this task from Backlog to Blocked on the ops-eqiad board.May 2 2018, 2:34 PM

Change 430391 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Repool db1098 after being checked for consistancy issues

https://gerrit.wikimedia.org/r/430391

Change 430391 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Repool db1098 after being checked for consistancy issues

https://gerrit.wikimedia.org/r/430391

No consistency issues found on s6 and s7, repooling.

jcrespo added a subscriber: RobH.May 2 2018, 3:26 PM

@RobH the specific incident for this host has been taken care, should we centralize the recurring issue into a separate task? If yes, I would close this as resolved for now.

jcrespo moved this task from In progress to Done on the DBA board.May 2 2018, 3:26 PM
jcrespo lowered the priority of this task from High to Normal.May 2 2018, 3:44 PM
Marostegui added a comment.EditedMay 2 2018, 4:50 PM

I vote for closing this and if it happens again on any other host - open a general task and a case with the vendor.

db1100 crashed half a year ago and never again after doing the upgrade (which I still think it is a consequence).

Marostegui closed this task as Resolved.May 4 2018, 5:34 AM

Let's create a general task if this happens again and consider this fixed for now and for this host.

Vvjjkkii renamed this task from db1098 crashed and got rebooted to 00daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Cmjohnson as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Marostegui renamed this task from 00daaaaaaa to db1098 crashed and got rebooted.Jul 1 2018, 6:22 PM
Marostegui closed this task as Resolved.
Marostegui assigned this task to Cmjohnson.
Marostegui lowered the priority of this task from High to Normal.
Marostegui updated the task description. (Show Details)
Marostegui added subscribers: Aklapper, GerritBot.