Page MenuHomePhabricator

db1100 crashed
Closed, ResolvedPublic

Description

This is a just-bought server (we start with issues soon this time :-)):

 	  	2017-09-15T00:56:06-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2017-09-15T00:24:35-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2017-09-15T00:24:34-0500	SYS1000	
System is turning on.
	
 
 	  	2017-09-15T00:24:25-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2017-09-15T00:24:25-0500	SYS1001	
System is turning off.
	
 
 	  	2017-09-15T00:24:21-0500	PWR2264	
The Intel Management Engine has reported normal system operation.
	
 
 	  	2017-09-15T00:24:10-0500	RAC0703	
Requested system hardreset.
	
 
 	  	2017-09-15T00:24:09-0500	PWR2262	
The Intel Management Engine has reported an internal system error.
	
 
 	  	2017-09-15T00:24:09-0500	CPU0000	
Internal error has occurred check for additional logs.

Event Timeline

jcrespo created this task.Sep 15 2017, 1:05 AM
jcrespo created this object with visibility "WMF-NDA (Project)".
jcrespo created this object with edit policy "WMF-NDA (Project)".
Restricted Application added a project: Operations. · View Herald TranscriptSep 15 2017, 1:05 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Making NDA-only for now, based on extreme, paranoid-level cation, until @MoritzMuehlenhoff or @Cmjohnson consider if we should worry about https://en.wikipedia.org/wiki/Intel_Active_Management_Technology#Known_vulnerabilities_and_exploits

When the AMT vulnerability was first announced, Dell was contacted and they confirmed the PowerEdge line of servers is not affected by this (only some of their desktop/workstation offerings). This seems like a standard hardware issue.

Thanks, I just wanted to doublecheck.

jcrespo changed the visibility from "WMF-NDA (Project)" to "All Users".Sep 15 2017, 7:28 AM
jcrespo changed the edit policy from "WMF-NDA (Project)" to "All Users".
jcrespo changed the visibility from "All Users" to "Public (No Login Required)".
jcrespo added a subscriber: RobH.Sep 15 2017, 8:07 AM

@Cmjohnson @RobH I assume there is not much left to do here at dc/provider level except keeping a record of the crash and complain if it repeats? This is one of the latest models bought.

However, it is a lot of coincidence that it crashes just hours after peing pooling and having some load: https://gerrit.wikimedia.org/r/378003 (it has been idle for weeks before). I would like to generate some cpu load to make sure this isn't repeatable.

jcrespo triaged this task as Low priority.Sep 15 2017, 8:16 AM

Low after being depooled.

jcrespo moved this task from Triage to In progress on the DBA board.Sep 15 2017, 11:13 AM
RobH added a comment.Sep 18 2017, 4:34 PM

We should update the firmware versions (as they are likely not the latest.) If we open a case with Dell, it will be the first thing they recommend.

Updating the bios firmware requires reboot when completed, so please advise when it would be a good time to do this.

Mentioned in SAL (#wikimedia-operations) [2017-09-18T16:46:32Z] <jynus> shuting down db1100 T175973

db1100 is depooled, I have downtime'ed it for a week so the BIOS update can happen at any time.

And put it down CC @Cmjohnson.

RobH assigned this task to Cmjohnson.Sep 18 2017, 5:57 PM

Actually, turns out this is already running the newest bios firmware:

http://www.dell.com/support/home/us/en/19/product-support/servicetag/jcb8hh2/drivers

BIOS Version: 2.4.3 is the latest version already.

the ilom isn't quite the latest, so I took the opportunity to go ahead and update it. However, it should not have anything to do with the CPU and Intel Systems Management error.

We'll need to open a trouble ticket with Dell about this. Not sure if they will want us to move the CPU and see if the issue follows, or replace the mainboard.

RobH added a comment.EditedSep 18 2017, 6:02 PM

some googling

http://www.dell.com/support/manuals/uk/en/ukdhs1/dell-opnmang-sw-v8.2/eemi_13g_v1.3-v2/pwr-event-messages?guid=guid-5bc564bd-a527-4c29-828b-ff4720644565&lang=en-us

Message
    The Intel Management Engine has reported an internal system error. 
Detailed Description
    The Intel Management Engine is unable to utilize the PECI over DMI facility. 
Recommended Response Action
    Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider. 
Category
    Audit (PWR = Power Usage) 
Severity
    Severity 2 (Warning)

We should remove all power, then put it all back and push back into service, and see if the issue persists.

RobH added a comment.Sep 18 2017, 6:04 PM

@jcrespo: Was the error output in the task description from the OS? We don't see any errors in the idrac/ilom event log.

The error on the description was on the lifecycle log. It gave the same description that you googled.

@jcrespo pulled power and reset....power on at your convenience.

Will do! Thanks. Please give me a heads up if any maintenance happens here, unless you tell me otherwise, I will put it back into production. We can put it down at any time later, but I do not want it down for a long time (replication keeps going forward :-), I just need to depool it beforehand.

No, go ahead and put back into production.

@jcrespo anything else with this? Feel free to resolve if an issue comes back please re-open

jcrespo closed this task as Resolved.Sep 23 2017, 4:32 PM