Change Details

In T214516 @RobH happened to notice correctable memory errors on `cp4026`. But nothing had fired in Icinga. Investigation so far is below. One possible option for now is to convert the Icinga rule to use the `mtail`-generated stats, although that seems suboptimal in the long run because more moving parts are involved. What we found so far: (disclaimer: I don't actually know the EDAC subsystem at all and all of the below is from a cursory reading of the docs) Icinga had not reported these because it uses the metrics from node-exporter, which themselves are backed by the sysfs files in `/sys/devices`. All of them report 0 on that machine: ``` cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat $(find -name '*ce*_count*' ) | grep -v '^0$' cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % ``` The issue doesn't seem to be something having touched `reset_counters`, as `seconds_since_reset` gives a value of something like 119 days: ``` cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat */seconds_since_reset 10276896 10276896 ``` However, the errors did make their way to the kernel and software stack on the machine. ``` cdanis@cp4026.ulsfo.wmnet /var/log % head mcelog Hardware event. This is not a software error. MCE 0 CPU 0 Undecoded extended event ff TSC 3a231039a1fd59 ADDR 45f92081c0 TIME 1545453235 Sat Dec 22 04:33:55 2018 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid ``` Although they've been rotated out of disk in `/var/log`, they're still in the `dmesg` buffer: ``` cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | head -n3 [7447051.961511] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x45f9208 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) [7447056.470250] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x4678bc8 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) [7447057.201331] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x486a1c8 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | wc -l 18 ``` They also made their way into `rsyslog` where they were picked up by our `mtail` rules: https://grafana.wikimedia.org/d/nApOnklmk/xxx-cdanis-edac-events?panelId=2&fullscreen&orgId=1&from=1545282909855&to=1545711454290

In T214516 @RobH happened to notice correctable memory errors on `cp4026`. But nothing had fired in Icinga. Investigation so far is below. One possible option for now is to convert the Icinga rule to use the `mtail`-generated stats, although that seems suboptimal in the long run because more moving parts are involved. What we found so far: (disclaimer: I don't actually know the EDAC subsystem at all and all of the below is from a cursory reading of the docs) Icinga had not reported these because it uses the metrics from node-exporter, which themselves are backed by the sysfs files in `/sys/devices`. All of them report 0 on that machine: ``` cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat $(find -name '*ce*_count*' ) | grep -v '^0$' cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % ``` The issue doesn't seem to be something having touched `reset_counters`, as `seconds_since_reset` gives a value of something like 119 days (which as it turns out is the uptime of the machine): ``` cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat */seconds_since_reset 10276896 10276896 ``` However, the errors did make their way to the kernel and software stack on the machine. ``` cdanis@cp4026.ulsfo.wmnet /var/log % head mcelog Hardware event. This is not a software error. MCE 0 CPU 0 Undecoded extended event ff TSC 3a231039a1fd59 ADDR 45f92081c0 TIME 1545453235 Sat Dec 22 04:33:55 2018 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid ``` Although they've been rotated out of disk in `/var/log`, they're still in the `dmesg` buffer: ``` cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | head -n3 [7447051.961511] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x45f9208 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) [7447056.470250] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x4678bc8 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) [7447057.201331] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x486a1c8 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | wc -l 18 ``` Full log at https://phabricator.wikimedia.org/P8031 They also made their way into `rsyslog` where they were picked up by our `mtail` rules: https://grafana.wikimedia.org/d/nApOnklmk/xxx-cdanis-edac-events?panelId=2&fullscreen&orgId=1&from=1545282909855&to=1545711454290