Page MenuHomePhabricator

(OoW) MCE errors on mw2181 / temperature warnings
Closed, ResolvedPublic

Description

A number of MCE errors have been logged, e.g. the one below. There's also a lot of temporature warnings in mcelog (with the CPUs throttled as a result), I'm wondering if the memory error is a result of overheating.

Sep 21 11:08:19 mw2181 kernel: [1563977.997564] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 21 11:08:19 mw2181 kernel: [1563977.997569] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004d000800c1
Sep 21 11:08:19 mw2181 kernel: [1563977.997570] EDAC sbridge MC0: TSC 0
Sep 21 11:08:19 mw2181 kernel: [1563977.997571] EDAC sbridge MC0: ADDR 4a38ec000
Sep 21 11:08:19 mw2181 kernel: [1563977.997572] EDAC sbridge MC0: MISC 908500010001a8c
Sep 21 11:08:19 mw2181 kernel: [1563977.997574] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1537528099 SOCKET 0 APIC 0
Sep 21 11:08:19 mw2181 kernel: [1563977.997592] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4a38ec offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:1)

Event Timeline

wiki_willy renamed this task from MCE errors on mw2181 / temperature warnings to (OoW) MCE errors on mw2181 / temperature warnings.Jul 15 2019, 8:54 PM
wiki_willy assigned this task to Papaul.

I checked the system log, no memory errors or temperature warnings but found out that the server firmware is very old. We can depool the server if possible and I can upgrade the firmware.

Papaul added subscribers: jijiki, Papaul.

This was a very long progress upgrading the IDRAC since the server had 1.5 I couldn't upgrade to 2.6 had to upgrade first to 1.6 than to 2.6
Before

Screenshot from 2019-07-17 11-23-43.png (373×565 px, 35 KB)

After
Screenshot from 2019-07-17 13-06-50.png (371×566 px, 41 KB)

@jijiki the server can be repool at anytime now.

Running 'scap pull' on this host (to sync mw code before repooling) fails with "sudo: /usr/local/bin/mwscript: command not found".

Made a separate task for the scap pull issue.

Repooled the server anyways.

Dzahn claimed this task.

mcelog has not been written to since Oct 10 2018. No new thermal events after that. So not sure if that tells us much about the firmware upgrade being related or not.

Though looks like we can close this.

Thanks @Papaul