cp1053 possible hardware issues
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	BBlack
	May 15 2017, 1:28 AM

Description

Machine is depooled from service.

CPU temp trips and MCE errors have been logging on cp1053 for at least a week, e.g.:

May 14 06:26:21 cp1053 kernel: [1536133.112970] CPU7: Core temperature above threshold, cpu clock throttled (total events = 4508187)
May 14 06:26:21 cp1053 kernel: [1536133.112971] CPU23: Core temperature above threshold, cpu clock throttled (total events = 4508763)
May 14 06:26:21 cp1053 kernel: [1536133.112984] mce_notify_irq: 1 callbacks suppressed
May 14 06:26:21 cp1053 kernel: [1536133.112984] mce: [Hardware Error]: Machine check events logged

As of today, we've had some small spikes of user-facing 503s that localized to this varnish backend, almost certainly somehow related.

Meta-point (perhaps separate task) - why aren't we catching things like CPU temp trips and MCEs in icinga alerting?

Event Timeline

BBlack created this task.May 15 2017, 1:28 AM

Restricted Application added a project: SRE. · View Herald TranscriptMay 15 2017, 1:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• MZMcBride subscribed.May 15 2017, 12:20 PM

• ema moved this task from Backlog to Caching on the Traffic board.May 15 2017, 1:39 PM

• Cmjohnson moved this task from Backlog to Lower Priority Items on the ops-eqiad board.Jul 20 2017, 3:25 PM

@BBlack The server is out of warranty but we could try and re-do the thermal paste.

BBlack moved this task from Caching to Hardware on the Traffic board.Oct 23 2017, 2:55 PM

Apparently this machine is back in service (since when I'm not sure, but it's been a while I think). It's still showing temp alerts in dmesg....

Interestingly, the IPMI sensors check in icinga is showing this machine as being fine. I wonder what the discrepancy is between that and the MCEs and dmesg?

There have been edac correctable memory errors reported for this host, raising priority to high since the cpu temp alerts also persist

Jun 13 04:49:32 cp1053 kernel: [6552887.159258] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159259] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159259] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159260] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159261] EDAC sbridge MC0: MISC 90840080008188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159262] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159268] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159270] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159271] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159271] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159272] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159272] EDAC sbridge MC0: MISC 90840040004188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159274] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159280] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159280] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159281] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159282] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159282] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159283] EDAC sbridge MC0: MISC 90840240024188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159284] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159302] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159303] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159304] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159304] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159305] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159305] EDAC sbridge MC0: MISC 90840080008188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159307] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159313] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:32 cp1053 kernel: [6552887.159316] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:32 cp1053 kernel: [6552887.159318] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004c000800c0
Jun 13 04:49:32 cp1053 kernel: [6552887.159322] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:32 cp1053 kernel: [6552887.159324] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:32 cp1053 kernel: [6552887.159328] EDAC sbridge MC0: MISC 90840380038188c 
Jun 13 04:49:32 cp1053 kernel: [6552887.159331] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865372 SOCKET 0 APIC 0
Jun 13 04:49:32 cp1053 kernel: [6552887.159339] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:49:33 cp1053 kernel: [6552888.187280] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:49:33 cp1053 kernel: [6552888.187284] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: cc00014c000800c0
Jun 13 04:49:33 cp1053 kernel: [6552888.187286] EDAC sbridge MC0: TSC 0 
Jun 13 04:49:33 cp1053 kernel: [6552888.187287] EDAC sbridge MC0: ADDR 110ea0a000 
Jun 13 04:49:33 cp1053 kernel: [6552888.187288] EDAC sbridge MC0: MISC 90840200020188c 
Jun 13 04:49:33 cp1053 kernel: [6552888.187290] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865373 SOCKET 0 APIC 0
Jun 13 04:49:33 cp1053 kernel: [6552888.187303] EDAC MC0: 5 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x110ea0a offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)
Jun 13 04:50:13 cp1053 kernel: [6552928.089437] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jun 13 04:50:13 cp1053 kernel: [6552928.089440] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: cc00180c000800c0
Jun 13 04:50:13 cp1053 kernel: [6552928.089440] EDAC sbridge MC0: TSC 0 
Jun 13 04:50:13 cp1053 kernel: [6552928.089442] EDAC sbridge MC0: ADDR 120ea0b000 
Jun 13 04:50:13 cp1053 kernel: [6552928.089442] EDAC sbridge MC0: MISC 90840100014188c 
Jun 13 04:50:13 cp1053 kernel: [6552928.089444] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1528865413 SOCKET 0 APIC 0
Jun 13 04:50:13 cp1053 kernel: [6552928.089458] EDAC MC0: 96 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x120ea0b offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:1)

Mentioned in SAL (#wikimedia-operations) [2018-06-13T08:46:16Z] <ema> depool cp1053 T165252

To be decommed in the next couple of weeks, no point!

cp1053 possible hardware issuesClosed, DeclinedPublicActions

Description

Event Timeline

cp1053 possible hardware issues
Closed, DeclinedPublic
Actions