Page MenuHomePhabricator

mw1239: memory scrubbing error
Closed, ResolvedPublic

Description

EDAC-reported correctable error, likely a bad DIMM

Oct 17 07:44:31 mw1239 kernel: [849391.191197] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)
Oct 17 09:10:43 mw1239 kernel: [854563.800975] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)
Oct 17 10:36:54 mw1239 kernel: [859734.331344] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)
Oct 17 12:03:02 mw1239 kernel: [864902.930145] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)
Oct 17 13:29:06 mw1239 kernel: [870067.673046] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)

Event Timeline

elukey triaged this task as Medium priority.Oct 18 2016, 1:01 PM
elukey added subscribers: Cmjohnson, elukey.

@Cmjohnson Hi! How should we proceed?

@elukey, The h/w log is not reporting any DIMM error at the moment please depool this server so I can do some testing but yes most likely DIMM.

Thanks

elukey@puppetmaster1001:~$ sudo -i confctl --quiet --find --action set/pooled=inactive mw1239.eqiad.wmnet
mw1239.eqiad.wmnet: pooled changed yes => inactive

elukey@puppetmaster1001:~$ sudo -i confctl --quiet --find --action get mw1239.eqiad.wmnet
{"mw1239.eqiad.wmnet": {"pooled": "inactive", "weight": 20}, "tags": "dc=eqiad,cluster=appserver,service=apache2"}

@Cmjohnson done!

Mentioned in SAL (#wikimedia-operations) [2016-10-19T17:15:52Z] <elukey> depooled mw1239.eqiad.wmnet to allow hw investigation (T148421) (was done today but didn't logged properly)

For the record, after a reboot the server is working correctly with no such errors in kern.log or dmesg. I will repool the server now.

@Cmjohnson I can still see errors in the dmesg :(

[Mon Dec  5 10:13:57 2016] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)
[Mon Dec  5 11:40:19 2016] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)
[Mon Dec  5 13:06:33 2016] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)

@elukey please depool. I will need to reseat the DIMM. I also see an error in the h/w log


Record: 2
Date/Time: 11/29/2016 09:03:47
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Mentioned in SAL (#wikimedia-operations) [2016-12-05T14:08:11Z] <elukey> depooling mw1239 for maintenance (T148421)

elukey@puppetmaster1001:~$ sudo -i confctl --quiet select 'name=mw1239.eqiad.wmnet' get
{"mw1239.eqiad.wmnet": {"pooled": "no", "weight": 20}, "tags": "dc=eqiad,cluster=appserver,service=apache2"}

@Cmjohnson done!

@elukey DIMM A1 swapped with B1. Let's see what happens

Mentioned in SAL (#wikimedia-operations) [2017-01-11T22:26:34Z] <elukey> added mw1239.eqiad.wmnet back to service - T148421

The error has not returned...resolving this task.