Page MenuHomePhabricator

db1100 crashed
Closed, ResolvedPublic

Description

This is a just-bought server (we start with issues soon this time :-)):

 	  	2017-09-15T00:56:06-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2017-09-15T00:24:35-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2017-09-15T00:24:34-0500	SYS1000	
System is turning on.
	
 
 	  	2017-09-15T00:24:25-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2017-09-15T00:24:25-0500	SYS1001	
System is turning off.
	
 
 	  	2017-09-15T00:24:21-0500	PWR2264	
The Intel Management Engine has reported normal system operation.
	
 
 	  	2017-09-15T00:24:10-0500	RAC0703	
Requested system hardreset.
	
 
 	  	2017-09-15T00:24:09-0500	PWR2262	
The Intel Management Engine has reported an internal system error.
	
 
 	  	2017-09-15T00:24:09-0500	CPU0000	
Internal error has occurred check for additional logs.

Event Timeline

jcrespo created this object with visibility "WMF-NDA (Project)".
jcrespo created this object with edit policy "WMF-NDA (Project)".
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

When the AMT vulnerability was first announced, Dell was contacted and they confirmed the PowerEdge line of servers is not affected by this (only some of their desktop/workstation offerings). This seems like a standard hardware issue.

Thanks, I just wanted to doublecheck.

jcrespo changed the visibility from "WMF-NDA (Project)" to "All Users".Sep 15 2017, 7:28 AM
jcrespo changed the edit policy from "WMF-NDA (Project)" to "All Users".
jcrespo changed the visibility from "All Users" to "Public (No Login Required)".

@Cmjohnson @RobH I assume there is not much left to do here at dc/provider level except keeping a record of the crash and complain if it repeats? This is one of the latest models bought.

However, it is a lot of coincidence that it crashes just hours after peing pooling and having some load: https://gerrit.wikimedia.org/r/378003 (it has been idle for weeks before). I would like to generate some cpu load to make sure this isn't repeatable.

Low after being depooled.

We should update the firmware versions (as they are likely not the latest.) If we open a case with Dell, it will be the first thing they recommend.

Updating the bios firmware requires reboot when completed, so please advise when it would be a good time to do this.

db1100 is depooled, I have downtime'ed it for a week so the BIOS update can happen at any time.

Actually, turns out this is already running the newest bios firmware:

http://www.dell.com/support/home/us/en/19/product-support/servicetag/jcb8hh2/drivers

BIOS Version: 2.4.3 is the latest version already.

the ilom isn't quite the latest, so I took the opportunity to go ahead and update it. However, it should not have anything to do with the CPU and Intel Systems Management error.

We'll need to open a trouble ticket with Dell about this. Not sure if they will want us to move the CPU and see if the issue follows, or replace the mainboard.

some googling

http://www.dell.com/support/manuals/uk/en/ukdhs1/dell-opnmang-sw-v8.2/eemi_13g_v1.3-v2/pwr-event-messages?guid=guid-5bc564bd-a527-4c29-828b-ff4720644565&lang=en-us

Message
    The Intel Management Engine has reported an internal system error. 
Detailed Description
    The Intel Management Engine is unable to utilize the PECI over DMI facility. 
Recommended Response Action
    Look for the PWR2264 "normal system operation" message in the Lifecycle Log after the PWR2262 entry. It may take 30 seconds for the message to be logged. If the PWR2264 message does not appear, do the following: Disconnect power from the server and reconnect power. Turn on the server. If the issue persists, contact your service provider. 
Category
    Audit (PWR = Power Usage) 
Severity
    Severity 2 (Warning)

We should remove all power, then put it all back and push back into service, and see if the issue persists.

@jcrespo: Was the error output in the task description from the OS? We don't see any errors in the idrac/ilom event log.

The error on the description was on the lifecycle log. It gave the same description that you googled.

@jcrespo pulled power and reset....power on at your convenience.

Will do! Thanks. Please give me a heads up if any maintenance happens here, unless you tell me otherwise, I will put it back into production. We can put it down at any time later, but I do not want it down for a long time (replication keeps going forward :-), I just need to depool it beforehand.

No, go ahead and put back into production.

@jcrespo anything else with this? Feel free to resolve if an issue comes back please re-open