Page MenuHomePhabricator

cp3009: memory scrubbing error
Closed, DeclinedPublic

Description

EDAC-reported correctable error, likely a bad DIMM

Oct 17 13:32:44 cp3009 kernel: [952045.299201] EDAC MC1: 6 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad98 offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299209] EDAC MC1: 5 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad98 offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299215] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad98 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299221] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad98 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299227] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad98 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299232] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad98 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299238] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299244] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299249] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299255] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299260] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299266] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299271] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299277] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:44 cp3009 kernel: [952045.299282] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad99 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)
Oct 17 13:32:45 cp3009 kernel: [952046.299147] EDAC MC1: 97 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0xefad9f offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:5)

Related Objects

Duplicates Merged Here
T148722: cp3009 hw issues

Event Timeline

elukey triaged this task as Medium priority.Oct 18 2016, 1:02 PM
elukey added subscribers: mark, elukey.

@mark Hi! How should we proceed?

@elukey: we're decommissionining a whole bunch of machines from that same batch, we can probably swap memory next time i'm there.

BBlack added a project: Traffic.
BBlack subscribed.

It's depooled from service as of yesterday as well (didn't see this ticket!).

Change 381988 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cp3009: remove from cluster

https://gerrit.wikimedia.org/r/381988

Change 381988 merged by BBlack:
[operations/puppet@production] cp3009: remove from cluster

https://gerrit.wikimedia.org/r/381988

mark raised the priority of this task from Medium to High.Jul 3 2018, 12:17 PM

After consultation with Ema and considering how long this server has been broken, is 1 out of 4 misc varnish servers and the misc cluster is being folded into text anyway, we decided it's not worth repairing this server.

Change 443827 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_misc: decommission cp3009

https://gerrit.wikimedia.org/r/443827

Change 443827 merged by Ema:
[operations/puppet@production] cache_misc: decommission cp3009

https://gerrit.wikimedia.org/r/443827

Change 443897 had a related patch set uploaded (by Ema; owner: Ema):
[operations/dns@master] Remove prod and mgmt entries for cp3009

https://gerrit.wikimedia.org/r/443897

Change 443897 merged by Ema:
[operations/dns@master] Remove prod and mgmt entries for cp3009

https://gerrit.wikimedia.org/r/443897