mw1041 has hardware issues and has shut down twice today. Inspecting the logs from before the last crash:
Nov 19 13:38:13 mw1041 kernel: [597479.852774] CMCI storm detected: switching to poll mode Nov 19 13:38:43 mw1041 kernel: [597509.855137] CMCI storm subsided: switching to interrupt mode Nov 19 13:39:15 mw1041 kernel: [597541.540792] mce_notify_irq: 20 callbacks suppressed Nov 19 13:39:15 mw1041 kernel: [597541.540797] mce: [Hardware Error]: Machine check events logged Nov 19 13:39:19 mw1041 kernel: [597546.235803] mce: [Hardware Error]: Machine check events logged
which were going on for days.
Looking inside mcelog we see:
TIME 1448029050 Fri Nov 20 14:17:30 2015 MCG status: MCi status: Error overflow Corrected error Error enabled MCi_ADDR register valid MCA: Instruction CACHE Level-0 Instruction-Fetch Error STATUS d400010000040150 MCGSTATUS 0 MCGCAP 1c09 APICID 10 SOCKETID 0 CPUID Vendor Intel Family 6 Model 44 Hardware event. This is not a software error. MCE 0 CPU 7 BANK 2 ADDR 304c5b0
always for Cpu 7, bank 2. It makes me guess we have a damaged RAM.
As suggested, this server is way out of warranty and we might consider decommissioning.