Page MenuHomePhabricator

mw1239 memory errors
Closed, DeclinedPublic

Description

mw1239 is experiencing memory errors, should we do a memory test?

[Fri Jul 12 10:37:31 2019] mce: [Hardware Error]: Machine check events logged
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c000048000800c1
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: TSC 0
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: ADDR 70148000
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: MISC 90008000800108c
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1562927847 SOCKET 0 APIC 0
[Fri Jul 12 10:37:31 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)

Event Timeline

jijiki triaged this task as Medium priority.Jul 12 2019, 11:07 AM
jijiki created this task.
jijiki updated the task description. (Show Details)
Cmjohnson added a subscriber: Cmjohnson.EditedJul 12 2019, 6:43 PM

This server is out of warranty, I can reseat the DIMM but will need the server to powered down. If the error persists then the server will need to decommissioned.

wiki_willy assigned this task to jijiki.Jul 15 2019, 6:51 PM
wiki_willy added a subscriber: wiki_willy.

Assigning to @jijiki for now. Hi Effie - let us know when it would be ok to take this server down to reseat the DIMM, and then assign the task back to @Cmjohnson when ready.

Thanks,
Willy

Mentioned in SAL (#wikimedia-operations) [2019-07-15T21:55:58Z] <jijiki> Depool mw1239 for maintenance - T227867

jijiki reassigned this task from jijiki to Cmjohnson.Jul 15 2019, 9:56 PM

Thank you!

Last log paste before clearing the log

Record: 4
Date/Time: 11/08/2018 00:18:01
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 5
Date/Time: 12/11/2018 12:56:19
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

I swapped all the DIMM from side A to side B cleared the log and powered back up. Please put the server back in service and let's see if the reseating worked.

Cmjohnson closed this task as Resolved.Jul 16 2019, 3:27 PM

I am resolving this ticket, please re-open and ping me if the problem returns.

Mentioned in SAL (#wikimedia-operations) [2019-07-17T08:17:30Z] <jijiki> Pool mw1239 - T227867

jijiki reopened this task as Open.Sep 25 2019, 8:30 PM

Host is alerting again, I will take a look tomorrow

Dzahn added a subscriber: Dzahn.Oct 1 2019, 12:26 AM

self-healing??

<+icinga-wm> RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1

self-healing??

<+icinga-wm> RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1

Not self healing no in the sense that it can come back any time. We monitor the rate of errors over the last 4d now, once that window passes and no more spikes in errors then the alert recovers

@jijiki - just following up to see if this is still an issue or if we can resolve this. Thanks, Willy

Dzahn added a comment.Oct 31 2019, 9:17 PM

still a problem i think:

1root@mw1239:~# dmesg | grep EDAC
2[ 21.288289] EDAC MC: Ver: 3.0.0
3[ 21.359281] EDAC sbridge: Seeking for: PCI ID 8086:0ea0
4[ 21.359295] EDAC sbridge: Seeking for: PCI ID 8086:0ea0
5[ 21.359304] EDAC sbridge: Seeking for: PCI ID 8086:0ea0
6[ 21.359307] EDAC sbridge: Seeking for: PCI ID 8086:0ea8
7[ 21.359315] EDAC sbridge: Seeking for: PCI ID 8086:0ea8
8[ 21.359337] EDAC sbridge: Seeking for: PCI ID 8086:0ea8
9[ 21.359341] EDAC sbridge: Seeking for: PCI ID 8086:0e71
10[ 21.359349] EDAC sbridge: Seeking for: PCI ID 8086:0e71
11[ 21.359357] EDAC sbridge: Seeking for: PCI ID 8086:0e71
12[ 21.359360] EDAC sbridge: Seeking for: PCI ID 8086:0eaa
13[ 21.359369] EDAC sbridge: Seeking for: PCI ID 8086:0eaa
14[ 21.359377] EDAC sbridge: Seeking for: PCI ID 8086:0eaa
15[ 21.359379] EDAC sbridge: Seeking for: PCI ID 8086:0eab
16[ 21.359387] EDAC sbridge: Seeking for: PCI ID 8086:0eab
17[ 21.359395] EDAC sbridge: Seeking for: PCI ID 8086:0eab
18[ 21.359398] EDAC sbridge: Seeking for: PCI ID 8086:0eac
19[ 21.359405] EDAC sbridge: Seeking for: PCI ID 8086:0eac
20[ 21.359413] EDAC sbridge: Seeking for: PCI ID 8086:0eac
21[ 21.359416] EDAC sbridge: Seeking for: PCI ID 8086:0ead
22[ 21.359424] EDAC sbridge: Seeking for: PCI ID 8086:0ead
23[ 21.359432] EDAC sbridge: Seeking for: PCI ID 8086:0ead
24[ 21.359435] EDAC sbridge: Seeking for: PCI ID 8086:0ec8
25[ 21.359444] EDAC sbridge: Seeking for: PCI ID 8086:0ec8
26[ 21.359451] EDAC sbridge: Seeking for: PCI ID 8086:0ec8
27[ 21.359452] EDAC sbridge: Seeking for: PCI ID 8086:0ec9
28[ 21.359462] EDAC sbridge: Seeking for: PCI ID 8086:0ec9
29[ 21.359469] EDAC sbridge: Seeking for: PCI ID 8086:0ec9
30[ 21.359470] EDAC sbridge: Seeking for: PCI ID 8086:0eca
31[ 21.359479] EDAC sbridge: Seeking for: PCI ID 8086:0eca
32[ 21.359487] EDAC sbridge: Seeking for: PCI ID 8086:0eca
33[ 21.359488] EDAC sbridge: Seeking for: PCI ID 8086:0e60
34[ 21.359500] EDAC sbridge: Seeking for: PCI ID 8086:0e6a
35[ 21.359512] EDAC sbridge: Seeking for: PCI ID 8086:0e6b
36[ 21.359524] EDAC sbridge: Seeking for: PCI ID 8086:0e6c
37[ 21.359536] EDAC sbridge: Seeking for: PCI ID 8086:0e6d
38[ 21.359548] EDAC sbridge: Seeking for: PCI ID 8086:0eb8
39[ 21.359560] EDAC sbridge: Seeking for: PCI ID 8086:0ebc
40[ 21.359773] EDAC MC0: Giving out device to module sbridge_edac.c controller Ivy Bridge Socket#0: DEV 0000:3f:0e.0 (INTERRUPT)
41[ 21.359961] EDAC MC1: Giving out device to module sbridge_edac.c controller Ivy Bridge Socket#1: DEV 0000:7f:0e.0 (INTERRUPT)
42[ 21.359962] EDAC sbridge: Ver: 1.1.1
43[6034477.865195] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
44[6034477.865198] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
45[6034477.865199] EDAC sbridge MC1: TSC 0
46[6034477.865201] EDAC sbridge MC1: ADDR 8a0148980
47[6034477.865204] EDAC sbridge MC1: MISC 152748600
48[6034477.865208] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569325013 SOCKET 1 APIC 20
49[6034477.865227] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
50[6034477.865229] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
51[6034477.865230] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
52[6034477.865231] EDAC sbridge MC1: TSC 0
53[6034477.865234] EDAC sbridge MC1: ADDR 0
54[6034477.865235] EDAC sbridge MC1: MISC 4900080008001000
55[6034477.865237] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569325013 SOCKET 1 APIC 20
56[6150270.298008] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
57[6150270.298010] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
58[6150270.298011] EDAC sbridge MC1: TSC 0
59[6150270.298012] EDAC sbridge MC1: ADDR 8a0148980
60[6150270.298013] EDAC sbridge MC1: MISC 20760400
61[6150270.298014] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569440802 SOCKET 1 APIC 20
62[6150270.298035] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
63[6150270.298036] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
64[6150270.298038] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
65[6150270.298038] EDAC sbridge MC1: TSC 0
66[6150270.298039] EDAC sbridge MC1: ADDR 0
67[6150270.298040] EDAC sbridge MC1: MISC 4900080008001000
68[6150270.298041] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569440802 SOCKET 1 APIC 20
69[6150270.897987] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
70[6150270.897989] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
71[6150270.897990] EDAC sbridge MC1: TSC 0
72[6150270.897991] EDAC sbridge MC1: ADDR 8a0148980
73[6150270.897991] EDAC sbridge MC1: MISC 40761a00
74[6150270.897993] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569440803 SOCKET 1 APIC 20
75[6150270.898013] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
76[6150270.898014] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
77[6150270.898016] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
78[6150270.898016] EDAC sbridge MC1: TSC 0
79[6150270.898017] EDAC sbridge MC1: ADDR 0
80[6150270.898017] EDAC sbridge MC1: MISC 4900080008001000
81[6150270.898019] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569440803 SOCKET 1 APIC 20
82[6150872.553454] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
83[6150872.553458] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
84[6150872.553460] EDAC sbridge MC1: TSC 0
85[6150872.553462] EDAC sbridge MC1: ADDR 8a0148980
86[6150872.553463] EDAC sbridge MC1: MISC 40742c00
87[6150872.553466] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569441404 SOCKET 1 APIC 20
88[6150872.553492] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
89[6150872.553494] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
90[6150872.553497] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
91[6150872.553498] EDAC sbridge MC1: TSC 0
92[6150872.553499] EDAC sbridge MC1: ADDR 0
93[6150872.553500] EDAC sbridge MC1: MISC 4900080008001000
94[6150872.553503] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569441404 SOCKET 1 APIC 20
95[6150876.712633] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
96[6150876.712635] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
97[6150876.712636] EDAC sbridge MC1: TSC 0
98[6150876.712637] EDAC sbridge MC1: ADDR 8a0148980
99[6150876.712637] EDAC sbridge MC1: MISC 30761800
100[6150876.712639] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569441409 SOCKET 1 APIC 20
101[6150876.712657] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
102[6150876.712658] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
103[6150876.712659] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
104[6150876.712660] EDAC sbridge MC1: TSC 0
105[6150876.712661] EDAC sbridge MC1: ADDR 0
106[6150876.712661] EDAC sbridge MC1: MISC 4900080008001000
107[6150876.712662] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569441409 SOCKET 1 APIC 20
108[6150895.186194] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
109[6150895.186198] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
110[6150895.186200] EDAC sbridge MC1: TSC 0
111[6150895.186201] EDAC sbridge MC1: ADDR 8a0148980
112[6150895.186202] EDAC sbridge MC1: MISC 30742c00
113[6150895.186204] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569441427 SOCKET 1 APIC 20
114[6150895.186228] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
115[6150895.186229] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
116[6150895.186232] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
117[6150895.186233] EDAC sbridge MC1: TSC 0
118[6150895.186234] EDAC sbridge MC1: ADDR 0
119[6150895.186235] EDAC sbridge MC1: MISC 4900080008001000
120[6150895.186237] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569441427 SOCKET 1 APIC 20
121[6237292.404028] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
122[6237292.404030] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
123[6237292.404030] EDAC sbridge MC1: TSC 0
124[6237292.404032] EDAC sbridge MC1: ADDR 8a0148980
125[6237292.404033] EDAC sbridge MC1: MISC 3076b200
126[6237292.404034] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527822 SOCKET 1 APIC 20
127[6237292.404050] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
128[6237292.404051] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
129[6237292.404052] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
130[6237292.404052] EDAC sbridge MC1: TSC 0
131[6237292.404053] EDAC sbridge MC1: ADDR 0
132[6237292.404054] EDAC sbridge MC1: MISC 4900080008001000
133[6237292.404055] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527822 SOCKET 1 APIC 20
134[6237293.796632] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
135[6237293.796635] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
136[6237293.796635] EDAC sbridge MC1: TSC 0
137[6237293.796636] EDAC sbridge MC1: ADDR 8a0148980
138[6237293.796637] EDAC sbridge MC1: MISC 4274e000
139[6237293.796638] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527823 SOCKET 1 APIC 20
140[6237293.796655] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
141[6237293.796656] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
142[6237293.796657] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
143[6237293.796658] EDAC sbridge MC1: TSC 0
144[6237293.796658] EDAC sbridge MC1: ADDR 0
145[6237293.796659] EDAC sbridge MC1: MISC 4900080008001000
146[6237293.796660] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527823 SOCKET 1 APIC 20
147[6237298.415035] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
148[6237298.415038] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
149[6237298.415039] EDAC sbridge MC1: TSC 0
150[6237298.415040] EDAC sbridge MC1: ADDR 8a0148980
151[6237298.415040] EDAC sbridge MC1: MISC 2076aa00
152[6237298.415042] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527828 SOCKET 1 APIC 20
153[6237298.415057] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
154[6237298.415058] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
155[6237298.415059] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
156[6237298.415060] EDAC sbridge MC1: TSC 0
157[6237298.415061] EDAC sbridge MC1: ADDR 0
158[6237298.415061] EDAC sbridge MC1: MISC 4900080008001000
159[6237298.415062] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527828 SOCKET 1 APIC 20
160[6237302.366425] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
161[6237302.366429] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
162[6237302.366430] EDAC sbridge MC1: TSC 0
163[6237302.366431] EDAC sbridge MC1: ADDR 8a0148980
164[6237302.366432] EDAC sbridge MC1: MISC 20728e00
165[6237302.366433] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527832 SOCKET 1 APIC 20
166[6237302.366450] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
167[6237302.366451] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
168[6237302.366452] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
169[6237302.366453] EDAC sbridge MC1: TSC 0
170[6237302.366453] EDAC sbridge MC1: ADDR 0
171[6237302.366454] EDAC sbridge MC1: MISC 4900080008001000
172[6237302.366455] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527832 SOCKET 1 APIC 20
173[6237305.555443] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
174[6237305.555445] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
175[6237305.555446] EDAC sbridge MC1: TSC 0
176[6237305.555447] EDAC sbridge MC1: ADDR 8a0148980
177[6237305.555448] EDAC sbridge MC1: MISC 3076c400
178[6237305.555449] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527835 SOCKET 1 APIC 20
179[6237305.555465] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
180[6237305.555466] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
181[6237305.555467] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
182[6237305.555467] EDAC sbridge MC1: TSC 0
183[6237305.555468] EDAC sbridge MC1: ADDR 0
184[6237305.555468] EDAC sbridge MC1: MISC 4900080008001000
185[6237305.555470] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527835 SOCKET 1 APIC 20
186[6237319.178913] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
187[6237319.178916] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
188[6237319.178917] EDAC sbridge MC1: TSC 0
189[6237319.178918] EDAC sbridge MC1: ADDR 8a0148980
190[6237319.178919] EDAC sbridge MC1: MISC 40769200
191[6237319.178920] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527848 SOCKET 1 APIC 20
192[6237319.178937] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
193[6237319.178938] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
194[6237319.178940] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
195[6237319.178940] EDAC sbridge MC1: TSC 0
196[6237319.178941] EDAC sbridge MC1: ADDR 0
197[6237319.178941] EDAC sbridge MC1: MISC 4900080008001000
198[6237319.178943] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569527848 SOCKET 1 APIC 20
199[6243094.903829] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
200[6243094.903831] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
201[6243094.903834] EDAC sbridge MC1: TSC 0
202[6243094.903836] EDAC sbridge MC1: ADDR 8a0148980
203[6243094.903836] EDAC sbridge MC1: MISC 5072b800
204[6243094.903837] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569533624 SOCKET 1 APIC 20
205[6243094.903855] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
206[6243094.903856] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
207[6243094.903857] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
208[6243094.903857] EDAC sbridge MC1: TSC 0
209[6243094.903858] EDAC sbridge MC1: ADDR 0
210[6243094.903859] EDAC sbridge MC1: MISC 4900080008001000
211[6243094.903860] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569533624 SOCKET 1 APIC 20
212[6243095.777756] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
213[6243095.777757] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
214[6243095.777758] EDAC sbridge MC1: TSC 0
215[6243095.777759] EDAC sbridge MC1: ADDR 8a0148980
216[6243095.777760] EDAC sbridge MC1: MISC 14076a800
217[6243095.777762] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569533625 SOCKET 1 APIC 20
218[6243095.777777] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
219[6243095.777779] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
220[6243095.777780] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
221[6243095.777780] EDAC sbridge MC1: TSC 0
222[6243095.777781] EDAC sbridge MC1: ADDR 0
223[6243095.777782] EDAC sbridge MC1: MISC 4900080008001000
224[6243095.777783] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569533625 SOCKET 1 APIC 20
225[6244923.086758] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
226[6244923.086765] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000048000800c1
227[6244923.086766] EDAC sbridge MC1: TSC 0
228[6244923.086769] EDAC sbridge MC1: ADDR 8a0148000
229[6244923.086770] EDAC sbridge MC1: MISC 90008000800108c
230[6244923.086773] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569535452 SOCKET 1 APIC 20
231[6244923.086798] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0)
232[6253244.733486] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
233[6253244.733488] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010091
234[6253244.733491] EDAC sbridge MC1: TSC 0
235[6253244.733492] EDAC sbridge MC1: ADDR 8a0148980
236[6253244.733493] EDAC sbridge MC1: MISC 40765c00
237[6253244.733494] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569543774 SOCKET 1 APIC 20
238[6253244.733509] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x980 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
239[6253244.733510] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
240[6253244.733512] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8800004800800091
241[6253244.733512] EDAC sbridge MC1: TSC 0
242[6253244.733513] EDAC sbridge MC1: ADDR 0
243[6253244.733513] EDAC sbridge MC1: MISC 490008000800108c
244[6253244.733514] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569543774 SOCKET 1 APIC 20
245[6265568.680435] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
246[6265568.680439] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000048000800c1
247[6265568.680440] EDAC sbridge MC1: TSC 0
248[6265568.680442] EDAC sbridge MC1: ADDR 8a0148000
249[6265568.680443] EDAC sbridge MC1: MISC 90008000800108c
250[6265568.680446] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1569556097 SOCKET 1 APIC 20
251[6265568.680471] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0)
252[8478767.304267] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
253[8478767.304269] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000048000800c1
254[8478767.304270] EDAC sbridge MC1: TSC 0
255[8478767.304272] EDAC sbridge MC1: ADDR 8a0148000
256[8478767.304273] EDAC sbridge MC1: MISC 90008000800108c
257[8478767.304274] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1571769231 SOCKET 1 APIC 20
258[8478767.304291] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0)
259[8762549.014353] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
260[8762549.014355] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000048000800c1
261[8762549.014357] EDAC sbridge MC1: TSC 0
262[8762549.014358] EDAC sbridge MC1: ADDR 8a0148000
263[8762549.014359] EDAC sbridge MC1: MISC 90008000800108c
264[8762549.014361] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1572053004 SOCKET 1 APIC 20
265[8762549.014382] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0)
266[9190830.514528] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
267[9190830.514532] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000048000800c1
268[9190830.514533] EDAC sbridge MC1: TSC 0
269[9190830.514534] EDAC sbridge MC1: ADDR 8a0148000
270[9190830.514535] EDAC sbridge MC1: MISC 90008000800108c
271[9190830.514537] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1572481273 SOCKET 1 APIC 20
272[9190830.514554] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8a0148 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0)

mw1239 is well out of warranty and is over 5 years old. Historically we decom these host at this stage in their life. We also have a several new MW servers waiting to be racked and will be on-line soon. If you want DIMM, we need to purchase a new stick.

@Cmjohnson mw1239 will be decommed soon via https://phabricator.wikimedia.org/T239054, we can close this task.

Dzahn closed this task as Declined.Nov 25 2019, 11:40 PM