Page MenuHomePhabricator

mw2182 crash
Closed, ResolvedPublic

Description

On boot, blocked on:

Alert!  System fatal error during previous boot
Cache and Core Box, Last Level Cache Error
Press F1 to continue, F2 for setup.

Logs:

 	  	2018-05-16T14:18:15-0500	CPU0000	
CPU 1 has an internal error (IERR).
	
 
 	  	2018-05-16T14:18:14-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:18:14-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:18:14-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:18:13-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:18:13-0500	CPU0704	
CPU 1 machine check error detected.
	
 
 	  	2018-05-16T14:17:39-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2018-05-16T14:17:11-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:11-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:11-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:10-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:10-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:09-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:09-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:09-0500	CPU9000	
An OEM diagnostic event occurred.
	
 
 	  	2018-05-16T14:17:08-0500	CPU0704	
CPU 1 machine check error detected.


Log Sequence Number: 140
Detailed Description:
System event log and OS logs may indicate that the exception is external to the processor.
Recommended Action:
1) Check system and operating system logs for exceptions. If no exceptions are found continue. 2) Turn system off and remove input power for one minute. Re-apply input power and turn system on. 3) Make sure the processor is seated correctly. 4) If the issue still persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
Comment: root

Event Timeline

I did a scap pull, but probably this need more hw research. Not removing/depooling it as no danger of problems for now on codfw.

The server is out of warranty since January. @Papaul: Do we have any decommissioned servers from which we could swap the broken CPU?

@MoritzMuehlenhoff yes we do have some decommissioned servers. This can be also a bad main board. What we can do first is to swap CPU position. Since the error is showing on CPU1 we can move CPU1 to CPU0 and CPU0 to CPU1 if we do have the error on CPU0 than the CPU1 is bad if we do have the error on CPU1 than we need to replace the main board.

Let me know if you have any questions.

@MoritzMuehlenhoff yes we do have some decommissioned servers. This can be also a bad main board. What we can do first is to swap CPU position. Since the error is showing on CPU1 we can move CPU1 to CPU0 and CPU0 to CPU1 if we do have the error on CPU0 than the CPU1 is bad if we do have the error on CPU1 than we need to replace the main board.

Sounds good to me, let's do that! I have marked downtime for mw2182 and powered it off.

@MoritzMuehlenhoff

  • update IDRAC and BIOS
  • clean log
  • Swap CPU1 with CPU0

Lets see what happen.

Thanks, I've repooled the sever. I'm keeping an eye on it throughout the week whether it now holds fine.

Vvjjkkii renamed this task from mw2182 crash to 8ucaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii removed MoritzMuehlenhoff as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
ArielGlenn renamed this task from 8ucaaaaaaa to mw2182 crash.Jul 1 2018, 1:07 PM
ArielGlenn assigned this task to MoritzMuehlenhoff.
ArielGlenn lowered the priority of this task from High to Medium.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added a subscriber: Aklapper.

Server is running fine since a while, closing the task