Page MenuHomePhabricator

(OoW) wtp2011 memory correctable errors
Closed, ResolvedPublic

Description

Jul 29 22:18:21 wtp2011 kernel: [12998594.544360] mce: [Hardware Error]: Machine check events logged
Jul 29 22:18:21 wtp2011 kernel: [12998594.544379] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 29 22:18:21 wtp2011 kernel: [12998594.544388] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 29 22:18:21 wtp2011 kernel: [12998594.544393] EDAC sbridge MC1: TSC 0 
Jul 29 22:18:21 wtp2011 kernel: [12998594.544395] EDAC sbridge MC1: ADDR e37c9e000 
Jul 29 22:18:21 wtp2011 kernel: [12998594.544396] EDAC sbridge MC1: MISC 90860002000028c 
Jul 29 22:18:21 wtp2011 kernel: [12998594.544398] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532902701 SOCKET 1 APIC 20
Jul 29 22:18:21 wtp2011 kernel: [12998594.544419] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 29 23:46:29 wtp2011 kernel: [13003881.930300] mce: [Hardware Error]: Machine check events logged
Jul 29 23:46:29 wtp2011 kernel: [13003881.930317] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 29 23:46:29 wtp2011 kernel: [13003881.930320] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 29 23:46:29 wtp2011 kernel: [13003881.930321] EDAC sbridge MC1: TSC 0 
Jul 29 23:46:29 wtp2011 kernel: [13003881.930322] EDAC sbridge MC1: ADDR e37c9e000 
Jul 29 23:46:29 wtp2011 kernel: [13003881.930323] EDAC sbridge MC1: MISC 90860002000028c 
Jul 29 23:46:29 wtp2011 kernel: [13003881.930324] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532907989 SOCKET 1 APIC 20
Jul 29 23:46:29 wtp2011 kernel: [13003881.930339] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 30 01:17:31 wtp2011 kernel: [13009344.432079] mce: [Hardware Error]: Machine check events logged
Jul 30 01:17:31 wtp2011 kernel: [13009344.432097] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 30 01:17:31 wtp2011 kernel: [13009344.432101] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 30 01:17:31 wtp2011 kernel: [13009344.432103] EDAC sbridge MC1: TSC 0 
Jul 30 01:17:31 wtp2011 kernel: [13009344.432104] EDAC sbridge MC1: ADDR e37c9e000 
Jul 30 01:17:31 wtp2011 kernel: [13009344.432105] EDAC sbridge MC1: MISC 90860002000028c 
Jul 30 01:17:31 wtp2011 kernel: [13009344.432106] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532913451 SOCKET 1 APIC 20
Jul 30 01:17:31 wtp2011 kernel: [13009344.432124] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 30 02:45:36 wtp2011 kernel: [13014629.951836] mce: [Hardware Error]: Machine check events logged
Jul 30 02:45:36 wtp2011 kernel: [13014629.951854] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 30 02:45:36 wtp2011 kernel: [13014629.951855] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 30 02:45:36 wtp2011 kernel: [13014629.951859] EDAC sbridge MC1: TSC 0 
Jul 30 02:45:36 wtp2011 kernel: [13014629.951860] EDAC sbridge MC1: ADDR e37c9e000 
Jul 30 02:45:36 wtp2011 kernel: [13014629.951860] EDAC sbridge MC1: MISC 90860002000028c 
Jul 30 02:45:36 wtp2011 kernel: [13014629.951862] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532918736 SOCKET 1 APIC 20
Jul 30 02:45:36 wtp2011 kernel: [13014629.951877] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 30 04:14:23 wtp2011 kernel: [13019956.807746] mce: [Hardware Error]: Machine check events logged
Jul 30 04:14:23 wtp2011 kernel: [13019956.807763] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 30 04:14:23 wtp2011 kernel: [13019956.807770] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 30 04:14:23 wtp2011 kernel: [13019956.807773] EDAC sbridge MC1: TSC 0 
Jul 30 04:14:23 wtp2011 kernel: [13019956.807774] EDAC sbridge MC1: ADDR e37c9e000 
Jul 30 04:14:23 wtp2011 kernel: [13019956.807775] EDAC sbridge MC1: MISC 90860002000028c 
Jul 30 04:14:23 wtp2011 kernel: [13019956.807776] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532924063 SOCKET 1 APIC 20
Jul 30 04:14:23 wtp2011 kernel: [13019956.807794] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 30 06:25:04 wtp2011 kernel: [13027798.446741] Process accounting resumed
Jul 30 07:15:41 wtp2011 kernel: [13030835.501344] mce: [Hardware Error]: Machine check events logged
Jul 30 07:15:41 wtp2011 kernel: [13030835.501361] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 30 07:15:41 wtp2011 kernel: [13030835.501365] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 30 07:15:41 wtp2011 kernel: [13030835.501366] EDAC sbridge MC1: TSC 0 
Jul 30 07:15:41 wtp2011 kernel: [13030835.501367] EDAC sbridge MC1: ADDR e37c9e000 
Jul 30 07:15:41 wtp2011 kernel: [13030835.501368] EDAC sbridge MC1: MISC 90860002000028c 
Jul 30 07:15:41 wtp2011 kernel: [13030835.501369] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532934941 SOCKET 1 APIC 20
Jul 30 07:15:41 wtp2011 kernel: [13030835.501385] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 30 08:43:45 wtp2011 kernel: [13036119.512785] mce: [Hardware Error]: Machine check events logged
Jul 30 08:43:45 wtp2011 kernel: [13036119.512819] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 30 08:43:45 wtp2011 kernel: [13036119.512820] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 30 08:43:45 wtp2011 kernel: [13036119.512821] EDAC sbridge MC1: TSC 0 
Jul 30 08:43:45 wtp2011 kernel: [13036119.512822] EDAC sbridge MC1: ADDR e37c9e000 
Jul 30 08:43:45 wtp2011 kernel: [13036119.512822] EDAC sbridge MC1: MISC 90860002000028c 
Jul 30 08:43:45 wtp2011 kernel: [13036119.512824] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532940225 SOCKET 1 APIC 20
Jul 30 08:43:45 wtp2011 kernel: [13036119.512838] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Jul 30 10:11:10 wtp2011 kernel: [13041364.886885] mce: [Hardware Error]: Machine check events logged
Jul 30 10:11:10 wtp2011 kernel: [13041364.886915] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 30 10:11:10 wtp2011 kernel: [13041364.886917] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1
Jul 30 10:11:10 wtp2011 kernel: [13041364.886920] EDAC sbridge MC1: TSC 0 
Jul 30 10:11:10 wtp2011 kernel: [13041364.886921] EDAC sbridge MC1: ADDR e37c9e000 
Jul 30 10:11:10 wtp2011 kernel: [13041364.886921] EDAC sbridge MC1: MISC 90860002000028c 
Jul 30 10:11:10 wtp2011 kernel: [13041364.886923] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532945470 SOCKET 1 APIC 20
Jul 30 10:11:10 wtp2011 kernel: [13041364.886937] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)

Event Timeline

herron triaged this task as High priority.Jul 31 2018, 8:02 PM

Mentioned in SAL (#wikimedia-operations) [2018-09-14T21:03:15Z] <mutante> ACKed memory error alert on wtp2011 - existing ticket but fresh alert popped up 9h ago (T200678)

wiki_willy renamed this task from wtp2011 memory correctable errors to (OoW) wtp2011 memory correctable errors.Jul 15 2019, 8:55 PM
wiki_willy assigned this task to Papaul.
Papaul lowered the priority of this task from High to Medium.Jul 17 2019, 8:23 PM

No memory errors showing on this system in the log . Upgrade IDRAC from 1.5 to 2.6 . We have a new BIOS version available we need to depool the server for the upgrade

I ran, SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service wtp2011 'depool service=parsoid' from deployment.eqiad.wmnet and it's now showing {"wtp2011.codfw.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}

Tailing /srv/log/parsoid/main.log from the host and the graphs seem to confirm it,
https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-instance=All

After the upgrade, I confirmed the host was still running the currently deployed version,

arlolra@wtp2011:~$ curl localhost:8000/version
{"name":"parsoid","version":"0.10.0+git","sha":"7232dfff04a305db11f6c6de33cecfae0b1a7801"}

and then repooled it,

Pooling parsoid on wtp2011.codfw.wmnet
codfw/parsoid/parsoid/wtp2011.codfw.wmnet: pooled changed no => yes

BIOS and IDRAC are now up to date on the server no memory errors in the log. closing this task for now. We can reopen if we do see again any errors.
thanks @Arlolra thanks