Page MenuHomePhabricator

cp1053 possible hardware issues
Closed, DeclinedPublic

Description

Machine is depooled from service.

CPU temp trips and MCE errors have been logging on cp1053 for at least a week, e.g.:

May 14 06:26:21 cp1053 kernel: [1536133.112970] CPU7: Core temperature above threshold, cpu clock throttled (total events = 4508187)
May 14 06:26:21 cp1053 kernel: [1536133.112971] CPU23: Core temperature above threshold, cpu clock throttled (total events = 4508763)
May 14 06:26:21 cp1053 kernel: [1536133.112984] mce_notify_irq: 1 callbacks suppressed
May 14 06:26:21 cp1053 kernel: [1536133.112984] mce: [Hardware Error]: Machine check events logged

As of today, we've had some small spikes of user-facing 503s that localized to this varnish backend, almost certainly somehow related.

Meta-point (perhaps separate task) - why aren't we catching things like CPU temp trips and MCEs in icinga alerting?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@BBlack The server is out of warranty but we could try and re-do the thermal paste.

Apparently this machine is back in service (since when I'm not sure, but it's been a while I think). It's still showing temp alerts in dmesg....

Interestingly, the IPMI sensors check in icinga is showing this machine as being fine. I wonder what the discrepancy is between that and the MCEs and dmesg?

fgiunchedi raised the priority of this task from Medium to High.Jun 13 2018, 8:43 AM
fgiunchedi subscribed.

There have been edac correctable memory errors reported for this host, raising priority to high since the cpu temp alerts also persist

Jun 13 04:49:32 cp1053 kernel: [6552887.159258] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159259] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159259] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159260] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159261] EDAC sbridge MC0: MISC 90840080008188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159262] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159268] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159270] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159271] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159271] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159272] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159272] EDAC sbridge MC0: MISC 90840040004188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159274] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159280] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159280] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159281] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159282] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159282] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159283] EDAC sbridge MC0: MISC 90840240024188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159284] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159302] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159303] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159304] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159304] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159305] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159305] EDAC sbridge MC0: MISC 90840080008188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159307] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159313] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159316] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159318] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159322] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159324] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159328] EDAC sbridge MC0: MISC 90840380038188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159331] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159339] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:33 cp1053 kernel: [6552888.187280] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:33 cp1053 kernel: [6552888.187284] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: cc00014c000800c0
Jun 13 04:49:33 cp1053 kernel: [6552888.187286] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:33 cp1053 kernel: [6552888.187287] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:33 cp1053 kernel: [6552888.187288] EDAC sbridge MC0: MISC 90840200020188c 
Jun 13 04:49:33 cp1053 kernel: [6552888.187290] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865373 SOCKET 0 APIC 0
Jun 13 04:49:33 cp1053 kernel: [6552888.187303] EDAC MC0: 5 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:50:13 cp1053 kernel: [6552928.089437] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:50:13 cp1053 kernel: [6552928.089440] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: cc00180c000800c0
Jun 13 04:50:13 cp1053 kernel: [6552928.089440] EDAC sbridge MC0: TSC 0 
Jun 13 04:50:13 cp1053 kernel: [6552928.089442] EDAC sbridge MC0: ADDR 120ea0b000 
Jun 13 04:50:13 cp1053 kernel: [6552928.089442] EDAC sbridge MC0: MISC 90840100014188c 
Jun 13 04:50:13 cp1053 kernel: [6552928.089444] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865413 SOCKET 0 APIC 0
Jun 13 04:50:13 cp1053 kernel: [6552928.089458] EDAC MC0: 96 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x120ea0b offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)

To be decommed in the next couple of weeks, no point!