Page MenuHomePhabricator

cloudvirt1022 memory errors causing host to crash
Closed, ResolvedPublic

Description

Cloudvirt1022 crashed today due to unrecoverable memory errors on DIMM A4.

dell iDRAC logs
Thu Jan 23 2020 16:47:05	Multi-bit memory errors detected on a memory device at location(s) DIMM_A4.	
Thu Jan 23 2020 16:47:05	Multi-bit memory errors detected on a memory device at location(s) DIMM_A4.	
Thu Jan 23 2020 16:47:05	A problem was detected in Memory Reference Code (MRC).	
Tue Nov 19 2019 23:01:12	Correctable memory error rate exceeded for DIMM_A4.	
Tue Nov 19 2019 23:01:09	Correctable memory error rate exceeded for DIMM_A4.	
Sat Aug 18 2018 01:35:53	Correctable memory error rate exceeded for DIMM_A4.	
Sat Aug 18 2018 01:35:46	Correctable memory error rate exceeded for DIMM_A4.	
Thu May 17 2018 22:50:26	Correctable memory error rate exceeded for DIMM_A4.	
Thu May 17 2018 22:50:10	Correctable memory error rate exceeded for DIMM_A4.	
Tue May 01 2018 14:22:05	Correctable memory error rate exceeded for DIMM_A4.	
Tue May 01 2018 14:21:59	Correctable memory error rate exceeded for DIMM_A4.	
Thu Mar 29 2018 11:50:48	Correctable memory error rate exceeded for DIMM_A4.	
Thu Mar 29 2018 11:50:48	Correctable memory error rate exceeded for DIMM_A4.
/var/log/mcelog
Hardware event. This is not a software error.
MCE 0
not finished?
CPU 8 BANK 3 TSC 62c107f0da4e3c 
RIP !INEXACT! 10:ffffffff9d21c2f0
TIME 1579797908 Thu Jan 23 16:45:08 2020
MCG status:RIPV MCIP 
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Generic CACHE Level-1 Snoop Error
STATUS f200000000300189 MCGSTATUS 5
MCGCAP 7000816 APICID 8 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 79

Event Timeline

Confirmed: Service Request 1011922914 was successfully submitted.

This server has running workloads that need to be drained prior to maintenance. I'll schedule a maintenance window and get it ready.

Change 572072 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: depool cloudvirt1022 and cloudvirt1014

https://gerrit.wikimedia.org/r/572072

Change 572072 merged by Andrew Bogott:
[operations/puppet@production] nova: depool cloudvirt1022 and cloudvirt1014

https://gerrit.wikimedia.org/r/572072

This server is now drained and ready for whatever.

Mentioned in SAL (#wikimedia-operations) [2020-02-18T21:07:38Z] <jeh> power down and set incinga downtime on cloudvirt1022 T243536

Thanks, @Jclark-ctr. I've confirmed the new DIMM is seen by the OS and the memory count is correct.

Change 572990 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] nova: add cloudvirt1022 to scheduler pool

https://gerrit.wikimedia.org/r/572990

Change 572990 merged by Jhedden:
[operations/puppet@production] nova: add cloudvirt1022 to scheduler pool

https://gerrit.wikimedia.org/r/572990