Page MenuHomePhabricator

Broken memory on elastic1029
Closed, ResolvedPublic

Description

EDAC flagged broken memory this morning. The host is out of warranty, but maybe we have a spare DIMM module from a decomissioned server?

Aug 15 03:20:03 elastic1029 kernel: [6193865.567508] mce: [Hardware Error]: Machine check events logged
Aug 15 03:20:03 elastic1029 kernel: [6193865.567587] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Aug 15 03:20:03 elastic1029 kernel: [6193865.567589] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010091
Aug 15 03:20:03 elastic1029 kernel: [6193865.567589] EDAC sbridge MC0: TSC 0
Aug 15 03:20:03 elastic1029 kernel: [6193865.567590] EDAC sbridge MC0: ADDR 7bb011440
Aug 15 03:20:03 elastic1029 kernel: [6193865.567591] EDAC sbridge MC0: MISC 140408400
Aug 15 03:20:03 elastic1029 kernel: [6193865.567592] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1534303203 SOCKET 0 APIC 0
Aug 15 03:20:03 elastic1029 kernel: [6193865.567607] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#1 (channel:1 slot:1 page:0x7bb011 offset:0x440 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:0 ha:0 channel_mask:2 rank:4)
Aug 15 03:20:03 elastic1029 kernel: [6193865.567608] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Aug 15 03:20:03 elastic1029 kernel: [6193865.567609] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8800004500800091
Aug 15 03:20:03 elastic1029 kernel: [6193865.567610] EDAC sbridge MC0: TSC 0
Aug 15 03:20:03 elastic1029 kernel: [6193865.567610] EDAC sbridge MC0: ADDR 0
Aug 15 03:20:03 elastic1029 kernel: [6193865.567611] EDAC sbridge MC0: MISC 5221004000400a8c
Aug 15 03:20:03 elastic1029 kernel: [6193865.567612] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1534303203 SOCKET 0 APIC 0
Aug 15 06:25:04 elastic1029 kernel: [6204966.963167] Process accounting resumed

Event Timeline

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2018-08-16T09:46:34Z] <gehel> all elasticsearch nodes reimaged (except elastic1029, waiting on memory issue) - T198391 / T193649 / T201991

Mentioned in SAL (#wikimedia-operations) [2018-08-16T16:12:32Z] <gehel> banning, depooling and shutting down elastic1029 for memory replacement - T201991

I reseated the DIMM and moved all on side A to side B. Powered on and server came back normally.

Gehel claimed this task.

Looking good!

Mathew.onipe subscribed.
This comment was removed by Mathew.onipe.