backup2001 crashed 2019-12-08
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Dec 9 2019, 6:41 AM

Description

Times in UTC

23:59:01 <+icinga-wm> PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100%
00:00:21 <+icinga-wm> RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms

root@backup2001:~# w
 06:40:17 up  6:40,  1 user,  load average: 0.08, 0.04, 0.00
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
marosteg pts/0    2620:0:860:2:208 06:40    1.00s  0.05s  0.02s sshd: marostegui [priv]

Lifecycle log:

 		2019-12-08 23:57:49 	SYS1003 	System CPU Resetting.	
	
Log Sequence Number:
1529
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Recommended Action:
No response action is required.
		2019-12-08 23:57:39 	SYS1000 	System is turning on.	
	
Log Sequence Number:
1528
Detailed Description:
System is turning on.
Recommended Action:
No response action is required.
		2019-12-08 23:57:29 	SYS1001 	System is turning off.	
	
Log Sequence Number:
1527
Detailed Description:
System is turning off.
Recommended Action:
No response action is required.
		2019-12-08 23:57:29 	SYS1003 	System CPU Resetting.	
	
Log Sequence Number:
1526
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Recommended Action:
No response action is required.
		2019-12-08 23:57:12 	RAC0703 	Requested system hardreset.	
	
Log Sequence Number:
1525
Detailed Description:
Requested system hardreset.
Recommended Action:
No response action is required.
		2019-12-08 23:57:08 	CPU0000 	Internal error has occurred check for additional logs.	
	
Log Sequence Number:
1524
Detailed Description:
System event log and OS logs may indicate the source of the error.
Recommended Action:
Review System Event Log and Operating System Logs. These logs can help the user identify the possible issue that is producing the problem.

No kernel or other logs before reboot:

Dec  8 23:55:01 backup2001 CRON[27600]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec  8 23:55:01 backup2001 CRON[27602]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/li
b/prometheus/node.d/puppet_agent.prom)
Dec  8 23:56:01 backup2001 CRON[27705]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/li
b/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Dec  8 23:59:56 backup2001 systemd-modules
-load[656]: Inserted module 'nf_conntrack'
Dec  8 23:59:56 backup2001 systemd-modules-load[656]: Inserted module 'ipmi_devintf'
Dec  8 23:59:56 backup2001 systemd[1]: Mounted Huge Pages File System.
Dec  8 23:59:56 backup2001 systemd[1]: Started Availability of block devices.
Dec  8 23:59:56 backup2001 systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress pol
ling.
Dec  8 23:59:56 backup2001 systemd[1]: Started Create list of required static device nodes for the current kernel.
Dec  8 23:59:56 backup2001 systemd[1]: Started Remount Root and Kernel File Systems.
Dec  8 23:59:56 backup2001 systemd[1]: Mounted Kernel Debug File System.
Dec  8 23:59:56 backup2001 systemd[1]: Mounted POSIX Message Queue File System.
Dec  8 23:59:56 backup2001 systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
Dec  8 23:59:56 backup2001 systemd[1]: Starting Create System Users...

Last occurrence: T237730

Related Objects

Mentioned In: T260764: backup2001 RAID controller failure, unable to post 2020-08-19
T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems
T238305: Servers freezing across the caching cluster
Mentioned Here: T225713: CPU scaling governor audit
T238305: Servers freezing across the caching cluster
T237730: backup2001 crashed with no logs on 2019-11-08 14:22

Event Timeline

• Marostegui created this task.Dec 9 2019, 6:41 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 9 2019, 6:41 AM

• Marostegui added a subscriber: akosiaris.Dec 9 2019, 6:41 AM

• Marostegui moved this task from Triage to In progress on the DBA board.Dec 10 2019, 6:43 AM

jcrespo claimed this task.Dec 10 2019, 11:16 AM

jcrespo triaged this task as Medium priority.

Not the first time this happens: T237730 And firmware was updated at that time.

@Papaul Could you file a support issue to vendor, given it is the second time this happened? What information do you need?

jcrespo mentioned this in T238305: Servers freezing across the caching cluster.Dec 11 2019, 9:23 AM

jcrespo renamed this task from backup2001 rebooted itself to backup2001 crashed 2019-12-08.Dec 11 2019, 9:27 AM

jcrespo updated the task description. (Show Details)Dec 11 2019, 9:30 AM

jcrespo updated the task description. (Show Details)Dec 11 2019, 9:41 AM

jcrespo updated the task description. (Show Details)Dec 11 2019, 9:47 AM

Log and boot:
https://drive.google.com/file/d/1YL-j3M9fMFGq9EkHxxOL6kVtOf4uyM-e/view?usp=sharing
https://drive.google.com/file/d/1E-5dZ_fitSE5TW0RmrrYRZsGt4DTitFn/view?usp=sharing

Moritz points at T238305#5731421 that maybe it is the same issue as: https://www.dell.com/community/PowerEdge-OS-Forum/Random-Reboot-R740/td-p/5169703/page/3

root@backup2001:~$ cat /sys/devices/system/cpu/*/cpufreq/scaling_governor | sort | uniq -c
     32 powersave

In T240177#5731486, @jcrespo wrote:
Moritz points at T238305#5731421 that maybe it is the same issue as: https://www.dell.com/community/PowerEdge-OS-Forum/Random-Reboot-R740/td-p/5169703/page/3
root@backup2001:~$ cat /sys/devices/system/cpu/*/cpufreq/scaling_governor | sort | uniq -c
     32 powersave

You've changed it to powersave or it is originally set to powersave?

You've changed it to powersave or it is originally set to powersave?

Didn't change anything, I pasted it as it is now. Most servers, including non-crashing backup1001 seems to be in that mode.

We're setting the governer via cpufrequtils class and cp/lvs hosts are already configured to use "performance", so I'd suggest to test that setting on one of the affected cp* hosts to gain additional data.

If we're convinced that it fixes the crashes we're seen, we can consider what do with other hosts, maybe there's also an option to disable the C / C1E settings without changing to performance profile?

disable C / C1E settings without changing to performance profile?

Do you know by any chance if performance governor sets that automatically (only needs to be changed) or it is a (potential) requirement. Sorry, I am not familiar with those settings.

are already configured to use "performance"

Also do you have handy the task/gerrit/puppet where that was set up?

I can search both answers, asking in case you know those already and are trivial to answer for you.

jcrespo claimed this task.Dec 11 2019, 2:51 PM

In T240177#5732180, @jcrespo wrote:

disable C / C1E settings without changing to performance profile?

Do you know by any chance if performance governor sets that automatically (only needs to be changed) or it is a (potential) requirement. Sorry, I am not familiar with those settings.

I don't think setting the performance governor in Linux will change that, but per the comment in the thread I would guess that the firmware disables these states (as someone wrote " v1.3.7 disables both C and C1E states" for a different Dell modell), but ultimately the firmwares are black boxes. Maybe we should simply refer the Dell support to the thread and ask if that's a known problem and which firmware and setting they recommend to mitigate this?

Also do you have handy the task/gerrit/puppet where that was set up?

We have https://phabricator.wikimedia.org/T225713 which touches this in general (as we noticed some problems with some servers being slow which were configured to "ondemand"

For LVS/cp in specific it seems to have been around since way before Phabricator, per git dating back to 2013.

regarding performance_governor, T225713 combined with this ticket seems unclear what is the best option.

Looks like this host crashed again?

[11:23:13]  <+icinga-wm>	PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:23:39]  <+icinga-wm>	RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms
root@backup2001:~# uptime
 11:27:53 up 4 min,  2 users,  load average: 0.02, 0.15, 0.08

There's nothing which indicates a cause of crash in SEL, syslog or kernel logs.

I think this bug is just another case of T238305

@Papaul could you proceed with T240177#5727654 as this is the 3rd crash, and the second since firmware upgrade.

wiki_willy subscribed.Jan 7 2020, 11:11 PM

@jcrespo the last time the system crashed we just upgrade the IDRAC and not the BIOS. The system BIOS version is at 1.37 right now or the new BISO version for the server is 2.4.8 and according the the Dell support website
I quote :"Fixes

Fixed a continuous reboot issue and Out of Resource error with PCIe IO resource allocation which was observed in the 2.4.7 version." see link below

https://www.dell.com/support/home/us/en/19/drivers/driversdetails?driverid=wgm2r&oscode=wst14&productcode=poweredge-r440

So I think let us do the BIOS upgrade on this system. Let me know when we can take it down for the BIOS upgrade.

Thanks

Thanks for clarifying Papaul. Jaime is off and will be back online the 9th of January

@Marostegui thanks will wait tomorrow the 9th so he can take the server down for the FW upgrade.

Mentioned in SAL (#wikimedia-operations) [2020-01-09T12:25:41Z] <jynus> shutting down backup2001 T240177

backup2001 is now down and ready to be done maintenance (no need to ask again). @Papaul please, when done, just boot it back up and ping here. Thanks.

Before

BIOS Version	1.3.7
iDRAC Firmware Version	3.34.34.34

After

BIOS Version	2.4.8
iDRAC Firmware Version	4.00.00.00

FW upgrade complete. We can resolve this task and re-open if it happen and open a support ticket with Dell.

Server is backup up 

Thanks

wiki_willy mentioned this in T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.Jan 21 2020, 4:13 PM

jcrespo mentioned this in T260764: backup2001 RAID controller failure, unable to post 2020-08-19.Aug 19 2020, 10:59 AM

backup2001 crashed 2019-12-08Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

backup2001 crashed 2019-12-08
Closed, ResolvedPublic
Actions