Page MenuHomePhabricator

es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down
Closed, ResolvedPublic

Description

<robh> Record:      2
<robh> Date/Time:   10/10/2016 03:52:20
<robh> Source:      system
<robh> Severity:    Critical
<robh> Description: CPU 1 has an internal error (IERR).
<robh> so thats not good

The issue with those servers is not solved.

Event Timeline

Change 315042 had a related patch set uploaded (by Jcrespo):
mariadb: Promote es2014 as the new es2 master of codfw

https://gerrit.wikimedia.org/r/315042

Change 315043 had a related patch set uploaded (by Jcrespo):
mariadb: Depool es2015 (master, crashed); replaced by es2016

https://gerrit.wikimedia.org/r/315043

Change 315042 merged by Jcrespo:
mariadb: Promote es2016 as the new es2 master of codfw

https://gerrit.wikimedia.org/r/315042

Change 315043 merged by Jcrespo:
mariadb: Depool es2015 (master, crashed); replaced by es2016

https://gerrit.wikimedia.org/r/315043

Seems the Dell tech is asking Papaul for hardware logs:

Syslog shows nothing for the hard crash:

Oct 10 03:29:34 es2015 puppet-agent[172665]: Retrieving pluginfacts
Oct 10 03:29:34 es2015 puppet-agent[172665]: Retrieving plugin
Oct 10 03:29:34 es2015 puppet-agent[172665]: Loading facts
Oct 10 03:29:39 es2015 puppet-agent[172665]: Caching catalog for es2015.codfw.wmnet
Oct 10 03:29:40 es2015 puppet-agent[172665]: Applying configuration version '1476069921'
Oct 10 03:29:54 es2015 puppet-agent[172665]: Finished catalog run in 14.54 seconds
Oct 10 03:35:01 es2015 CRON[174405]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 10 03:45:01 es2015 CRON[175097]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 10 03:57:04 es2015 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1483" x-info="http://www.rsyslog.com"] start
Oct 10 03:57:04 es2015 systemd[1]: Started Load/Save Random Seed.
Oct 10 03:57:04 es2015 systemd[1]: Started Apply Kernel Variables.
Oct 10 03:57:04 es2015 systemd[1]: Started Create Static Device Nodes in /dev.
Oct 10 03:57:04 es2015 systemd[1]: Starting udev Kernel Device Manager...
Oct 10 03:57:04 es2015 systemd[1]: Starting Local File Systems (Pre).
Oct 10 03:57:04 es2015 systemd[1]: Reached target Local File Systems (Pre).
Oct 10 03:57:04 es2015 systemd[1]: Started udev Kernel Device Manager.

Original task description shows the output from racadm getsel:

Record: 2
Date/Time: 10/10/2016 03:52:20
Source: system
Severity: Critical
Description: CPU 1 has an internal error (IERR).

From the IDRAC 8 web console:

Log:
Normal	Mon Feb 08 2016 16:08:44	Log cleared.
Critical	Mon Oct 10 2016 03:52:20	CPU 1 has an internal error (IERR).

Lifecycle Log:

 	  	2016-10-10T03:52:20-0500	CPU0000	
CPU 1 has an internal error (IERR).
	
 
 	  	2016-10-10T03:52:20-0500	LOG007	
The previous log entry was repeated 1 times.
	
 
 	  	2016-10-10T03:52:06-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2016-10-10T03:52:04-0500	SYS1000	
System is turning on.
	
 
 	  	2016-10-10T03:51:57-0500	RAC0703	
Requested system hardreset.
	
 
 	  	2016-10-10T03:51:56-0500	SYS1003	
System CPU Resetting.
	
 
 	  	2016-10-10T03:51:56-0500	SYS1001	
System is turning off.

CPU error detail:

CPU0000: CPU 1 has an internal error (IERR).
 2016-10-10T03:52:20-0500
Log Sequence Number: 236
Detailed Description:
System event log and OS logs may indicate that the exception is external to the processor.
Recommended Action:
Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
Comment: root
jcrespo renamed this task from es2015 crashed with no logs to es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down.Oct 11 2016, 4:37 PM

Enterprise Service Request

Hello Papaun,

Thank you for contacting Dell! This issue has been assigned to Service Request 937591198.

I will be the best person to contact until your issue is resolved.

If you need to contact me in the future please respond to this e-mail.

Thank you for choosing Dell,

Seymour Fletcher
Enterprise Technical Support Specialist

Dell | Enterprise

Toll Free 1-800-822-8965 ( 1-800-387-5757 Dell Canada )

My work schedule is Mon-Fri 9:30AM - 6:30PM EST

Our support queue is open 8am - 8pm Monday - Friday Eastern Standard Time

Customer feedback | How am I doing? Please contact my manager US_ENT_Manager@Dell.com

Diagnose an issue

Research a topic

Get order support

Contact us

Support for other Dell brands

BIOS: 2.2.5

http://downloads.dell.com/FOLDER03917193M/1/BIOS_PFWCY_WN32_2.2.5.EXE

iDRAC-LC fw:

iDRAC:

http://downloads.dell.com/FOLDER03526198M/2/iDRAC-with-Lifecycle-Controller_Firmware_5GCHC_WN32_2.30.30.30_A00.EXE

Fletcher Seymour
Enterprise Technical Support Specialist
Dell | Enterprise
Toll Free 1-800-822-8965 (1-800-387-5757 Dell Canada)
Office Hours: Mon-Fri 9:30AM - 6:30PM EST

Today october 11th I call Dell Support for this issue.

Call time 10:52 am
call duration = 54 min 3 sec

I mentioned to the Engineer that the is the second time we are calling for the same issue since back in August we had the same problem on es2017 and es2019 what was done on those systems was to replace the memory and after that we had the same problem again and we had to run a complete HW firmware upgrade on all 3rd generation server (R730) include es2015. Unfortunately once of the system crashed yesterday.

I send the log file that we had and according to the Engineer, the log file provided doesn't give us more information to determinate what is the cause the server to crashed so he asked that we provide him with an OS log file. I chat with Rob and Jaime on IRC and asked for any OS log file but we didn't find any helpful log file.

I also give him the BIOS version on the server he sent me a link to update the BIOS since. I mentioned to him as well t the replacement of the motherboard and the CPU he said that just that error message doesn't allow to go ahead and replace those parts.

Chris,

I'll escalate this to our account team, but can you dispatch over a replacement mainboard and CPU in the meantime? Past thread history shows we've already rolled bios updates, so I'm not convinced their idea of doing it again is any good.

Please see if you can use the self-dispatch to send over those two parts, and then assign back to me for followup with our sales reps.

Thanks!

es2014 freezed this morning- it complained abouit the filesystem; The hardware logs confirm:

  	2016-10-17T08:39:42-0500	USR0032	
The session for root from 208.80.154.149 using SSH is logged off.
	
 
 	  	2016-10-17T08:08:49-0500	PDR8	
Disk 11 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:49-0500	PDR8	
Disk 10 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:49-0500	PDR8	
Disk 9 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:49-0500	PDR8	
Disk 8 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:49-0500	PDR8	
Disk 7 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:49-0500	PDR8	
Disk 6 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:48-0500	PDR8	
Disk 5 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:48-0500	PDR8	
Disk 4 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:48-0500	PDR8	
Disk 3 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:48-0500	PDR8	
Disk 2 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:48-0500	PDR8	
Disk 1 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:08:48-0500	PDR8	
Disk 0 in Backplane 1 of Integrated RAID Controller 1 is inserted.
	
 
 	  	2016-10-17T08:05:19-0500	USR0030	
Successfully logged in using root, from 208.80.154.149 and SSH.

Unless someone extracted all disks at the same time without creating logs, I believe this to be a hardware controller issue.

jcrespo raised the priority of this task from Medium to High.Oct 17 2016, 10:01 AM

@jcrespo is this es2014 or es2015? if it es2014 can you please make a separate task for that?

Thanks.

it is es2014. I will, but I wanted to signal it here- it is part of the same batch creating issues.

@Papaul please swap cpu1 to cpu2 and clear syslog (racadm clrsel).

We will need to wait and see if it presents itself again. Unfortunately, the Dell tech you spoke is right. He can't just dispatch parts and neither can I until we've done some troubleshooting.

Syslog as of 10/17 (adding this for history)


Record: 2
Date/Time: 10/10/2016 03:52:20
Source: system
Severity: Critical

Description: CPU 1 has an internal error (IERR).

/admin1->

Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:40:20Z] <marostegui> Shutting down es2015 for hardware maintenance - T147769

Below are the step taken to troubleshoot this issue.

1- Swapped CPU 1 to CPU2
2 - Update BIOS from 2.1.7 to 2.2.5
3 - Update IDRAC firmware from 2.21 to
4- Clear syslog

Leaving this task open for now .

MySQL is back up and replicating
Thanks @Papaul

Thanks @Papaul, let me know if the error returns and where.

Papaul lowered the priority of this task from High to Medium.Oct 20 2016, 6:12 PM

Maybe this server still needs a reboot, as it has been having the icinga warning about not being able to read sysctl parameters for a day now.
However, sysctl -a works fine.

I believe this happened as well to db1082 and a reboot fixed it.

It has been a month now this system hasn't reported the same error after swapping the CPU. I am resolving this task.