Page MenuHomePhabricator

(OoW) MCE errors on mw2181 / temperature warnings
Closed, ResolvedPublic

Description

A number of MCE errors have been logged, e.g. the one below. There's also a lot of temporature warnings in mcelog (with the CPUs throttled as a result), I'm wondering if the memory error is a result of overheating.

Sep 21 11:08:19 mw2181 kernel: [1563977.997564] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 21 11:08:19 mw2181 kernel: [1563977.997569] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004d000800c1
Sep 21 11:08:19 mw2181 kernel: [1563977.997570] EDAC sbridge MC0: TSC 0
Sep 21 11:08:19 mw2181 kernel: [1563977.997571] EDAC sbridge MC0: ADDR 4a38ec000
Sep 21 11:08:19 mw2181 kernel: [1563977.997572] EDAC sbridge MC0: MISC 908500010001a8c
Sep 21 11:08:19 mw2181 kernel: [1563977.997574] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1537528099 SOCKET 0 APIC 0
Sep 21 11:08:19 mw2181 kernel: [1563977.997592] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x4a38ec offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:1)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2018, 6:41 AM
MoritzMuehlenhoff triaged this task as Normal priority.Sep 24 2018, 10:07 AM
wiki_willy renamed this task from MCE errors on mw2181 / temperature warnings to (OoW) MCE errors on mw2181 / temperature warnings.Jul 15 2019, 8:54 PM
wiki_willy assigned this task to Papaul.

I checked the system log, no memory errors or temperature warnings but found out that the server firmware is very old. We can depool the server if possible and I can upgrade the firmware.

Mentioned in SAL (#wikimedia-operations) [2019-07-17T16:19:58Z] <jijiki> Depool mw2181 - T205240

Papaul added subscribers: jijiki, Papaul.

This was a very long progress upgrading the IDRAC since the server had 1.5 I couldn't upgrade to 2.6 had to upgrade first to 1.6 than to 2.6
Before


After

@jijiki the server can be repool at anytime now.

Dzahn removed MoritzMuehlenhoff as the assignee of this task.Jul 17 2019, 6:12 PM
Dzahn added a project: serviceops.

Mentioned in SAL (#wikimedia-operations) [2019-07-17T18:14:50Z] <mutante> mw2181 - scap pull (T205240)

Dzahn added a subscriber: Dzahn.Jul 17 2019, 6:24 PM

Running 'scap pull' on this host (to sync mw code before repooling) fails with "sudo: /usr/local/bin/mwscript: command not found".

Dzahn added a comment.Jul 17 2019, 7:07 PM

Made a separate task for the scap pull issue.

Repooled the server anyways.

Dzahn closed this task as Resolved.Jul 17 2019, 7:10 PM
Dzahn claimed this task.

mcelog has not been written to since Oct 10 2018. No new thermal events after that. So not sure if that tells us much about the firmware upgrade being related or not.

Though looks like we can close this.

Thanks @Papaul