Jul 29 22:18:21 wtp2011 kernel: [12998594.544360] mce: [Hardware Error]: Machine check events logged Jul 29 22:18:21 wtp2011 kernel: [12998594.544379] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 29 22:18:21 wtp2011 kernel: [12998594.544388] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 29 22:18:21 wtp2011 kernel: [12998594.544393] EDAC sbridge MC1: TSC 0 Jul 29 22:18:21 wtp2011 kernel: [12998594.544395] EDAC sbridge MC1: ADDR e37c9e000 Jul 29 22:18:21 wtp2011 kernel: [12998594.544396] EDAC sbridge MC1: MISC 90860002000028c Jul 29 22:18:21 wtp2011 kernel: [12998594.544398] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532902701 SOCKET 1 APIC 20 Jul 29 22:18:21 wtp2011 kernel: [12998594.544419] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 29 23:46:29 wtp2011 kernel: [13003881.930300] mce: [Hardware Error]: Machine check events logged Jul 29 23:46:29 wtp2011 kernel: [13003881.930317] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 29 23:46:29 wtp2011 kernel: [13003881.930320] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 29 23:46:29 wtp2011 kernel: [13003881.930321] EDAC sbridge MC1: TSC 0 Jul 29 23:46:29 wtp2011 kernel: [13003881.930322] EDAC sbridge MC1: ADDR e37c9e000 Jul 29 23:46:29 wtp2011 kernel: [13003881.930323] EDAC sbridge MC1: MISC 90860002000028c Jul 29 23:46:29 wtp2011 kernel: [13003881.930324] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532907989 SOCKET 1 APIC 20 Jul 29 23:46:29 wtp2011 kernel: [13003881.930339] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 30 01:17:31 wtp2011 kernel: [13009344.432079] mce: [Hardware Error]: Machine check events logged Jul 30 01:17:31 wtp2011 kernel: [13009344.432097] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 30 01:17:31 wtp2011 kernel: [13009344.432101] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 30 01:17:31 wtp2011 kernel: [13009344.432103] EDAC sbridge MC1: TSC 0 Jul 30 01:17:31 wtp2011 kernel: [13009344.432104] EDAC sbridge MC1: ADDR e37c9e000 Jul 30 01:17:31 wtp2011 kernel: [13009344.432105] EDAC sbridge MC1: MISC 90860002000028c Jul 30 01:17:31 wtp2011 kernel: [13009344.432106] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532913451 SOCKET 1 APIC 20 Jul 30 01:17:31 wtp2011 kernel: [13009344.432124] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 30 02:45:36 wtp2011 kernel: [13014629.951836] mce: [Hardware Error]: Machine check events logged Jul 30 02:45:36 wtp2011 kernel: [13014629.951854] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 30 02:45:36 wtp2011 kernel: [13014629.951855] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 30 02:45:36 wtp2011 kernel: [13014629.951859] EDAC sbridge MC1: TSC 0 Jul 30 02:45:36 wtp2011 kernel: [13014629.951860] EDAC sbridge MC1: ADDR e37c9e000 Jul 30 02:45:36 wtp2011 kernel: [13014629.951860] EDAC sbridge MC1: MISC 90860002000028c Jul 30 02:45:36 wtp2011 kernel: [13014629.951862] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532918736 SOCKET 1 APIC 20 Jul 30 02:45:36 wtp2011 kernel: [13014629.951877] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 30 04:14:23 wtp2011 kernel: [13019956.807746] mce: [Hardware Error]: Machine check events logged Jul 30 04:14:23 wtp2011 kernel: [13019956.807763] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 30 04:14:23 wtp2011 kernel: [13019956.807770] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 30 04:14:23 wtp2011 kernel: [13019956.807773] EDAC sbridge MC1: TSC 0 Jul 30 04:14:23 wtp2011 kernel: [13019956.807774] EDAC sbridge MC1: ADDR e37c9e000 Jul 30 04:14:23 wtp2011 kernel: [13019956.807775] EDAC sbridge MC1: MISC 90860002000028c Jul 30 04:14:23 wtp2011 kernel: [13019956.807776] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532924063 SOCKET 1 APIC 20 Jul 30 04:14:23 wtp2011 kernel: [13019956.807794] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 30 06:25:04 wtp2011 kernel: [13027798.446741] Process accounting resumed Jul 30 07:15:41 wtp2011 kernel: [13030835.501344] mce: [Hardware Error]: Machine check events logged Jul 30 07:15:41 wtp2011 kernel: [13030835.501361] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 30 07:15:41 wtp2011 kernel: [13030835.501365] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 30 07:15:41 wtp2011 kernel: [13030835.501366] EDAC sbridge MC1: TSC 0 Jul 30 07:15:41 wtp2011 kernel: [13030835.501367] EDAC sbridge MC1: ADDR e37c9e000 Jul 30 07:15:41 wtp2011 kernel: [13030835.501368] EDAC sbridge MC1: MISC 90860002000028c Jul 30 07:15:41 wtp2011 kernel: [13030835.501369] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532934941 SOCKET 1 APIC 20 Jul 30 07:15:41 wtp2011 kernel: [13030835.501385] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 30 08:43:45 wtp2011 kernel: [13036119.512785] mce: [Hardware Error]: Machine check events logged Jul 30 08:43:45 wtp2011 kernel: [13036119.512819] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 30 08:43:45 wtp2011 kernel: [13036119.512820] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 30 08:43:45 wtp2011 kernel: [13036119.512821] EDAC sbridge MC1: TSC 0 Jul 30 08:43:45 wtp2011 kernel: [13036119.512822] EDAC sbridge MC1: ADDR e37c9e000 Jul 30 08:43:45 wtp2011 kernel: [13036119.512822] EDAC sbridge MC1: MISC 90860002000028c Jul 30 08:43:45 wtp2011 kernel: [13036119.512824] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532940225 SOCKET 1 APIC 20 Jul 30 08:43:45 wtp2011 kernel: [13036119.512838] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Jul 30 10:11:10 wtp2011 kernel: [13041364.886885] mce: [Hardware Error]: Machine check events logged Jul 30 10:11:10 wtp2011 kernel: [13041364.886915] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jul 30 10:11:10 wtp2011 kernel: [13041364.886917] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000041000800c1 Jul 30 10:11:10 wtp2011 kernel: [13041364.886920] EDAC sbridge MC1: TSC 0 Jul 30 10:11:10 wtp2011 kernel: [13041364.886921] EDAC sbridge MC1: ADDR e37c9e000 Jul 30 10:11:10 wtp2011 kernel: [13041364.886921] EDAC sbridge MC1: MISC 90860002000028c Jul 30 10:11:10 wtp2011 kernel: [13041364.886923] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1532945470 SOCKET 1 APIC 20 Jul 30 10:11:10 wtp2011 kernel: [13041364.886937] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xe37c9e offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Description
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2018-09-14T21:03:15Z] <mutante> ACKed memory error alert on wtp2011 - existing ticket but fresh alert popped up 9h ago (T200678)
No memory errors showing on this system in the log . Upgrade IDRAC from 1.5 to 2.6 . We have a new BIOS version available we need to depool the server for the upgrade
I ran, SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service wtp2011 'depool service=parsoid' from deployment.eqiad.wmnet and it's now showing {"wtp2011.codfw.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=codfw,cluster=parsoid,service=parsoid"}
Tailing /srv/log/parsoid/main.log from the host and the graphs seem to confirm it,
https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=parsoid&var-instance=All
After the upgrade, I confirmed the host was still running the currently deployed version,
arlolra@wtp2011:~$ curl localhost:8000/version {"name":"parsoid","version":"0.10.0+git","sha":"7232dfff04a305db11f6c6de33cecfae0b1a7801"}
and then repooled it,
Pooling parsoid on wtp2011.codfw.wmnet codfw/parsoid/parsoid/wtp2011.codfw.wmnet: pooled changed no => yes
BIOS and IDRAC are now up to date on the server no memory errors in the log. closing this task for now. We can reopen if we do see again any errors.
thanks @Arlolra thanks