Page MenuHomePhabricator

Heating alerts / memory errors on mw1254
Closed, ResolvedPublic

Description

mw1254 has MCE memory errors logged. There's also plenty of warnings in syslog that various CPUs have been throttled due to heating issues, so I'm wondering whether the memory errors are maybe also heating related? There are no RAM errors logged in "racadm getsel", so we could reseat the RAM to see whether the problem persists? Server is depooled and can be taken down any time.

Update 18 June 2019:

New errors on syslog

[Tue Jun 18 20:21:38 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue Jun 18 20:21:38 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: c800008c00800092
[Tue Jun 18 20:21:38 2019] EDAC sbridge MC0: TSC 0
[Tue Jun 18 20:21:38 2019] EDAC sbridge MC0: ADDR 0
[Tue Jun 18 20:21:38 2019] EDAC sbridge MC0: MISC c908408000801800
[Tue Jun 18 20:21:38 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1560889237 SOCKET 0 APIC 0
[Tue Jun 18 20:21:39 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Tue Jun 18 20:21:39 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010092
[Tue Jun 18 20:21:39 2019] EDAC sbridge MC0: TSC 0
[Tue Jun 18 20:21:39 2019] EDAC sbridge MC0: ADDR 3011e4240
[Tue Jun 18 20:21:39 2019] EDAC sbridge MC0: MISC 425e9000
[Tue Jun 18 20:21:39 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1560889238 SOCKET 0 APIC 0
[Tue Jun 18 20:21:39 2019] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3011e4 offset:0x240 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)

but no errors in "racadm getsel"

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2018-09-19T17:20:25Z] <cmjohnson1> powering off mw1254 to reseat DIMM T204491

@MoritzMuehlenhoff I ended up just swapping the DIMM between side A and B....leaving open to see if it helps

Ok, I've repooled the server for now.

MoritzMuehlenhoff claimed this task.

This error hasn't resurfaced, I'm closing the task.

jijiki triaged this task as Medium priority.
jijiki added a project: serviceops.
jijiki updated the task description. (Show Details)