In T214516 @RobH happened to notice correctable memory errors on cp4026. But nothing had fired in Icinga. Investigation so far is below.
One possible option for now is to convert the Icinga rule to use the mtail-generated stats, although that seems suboptimal in the long run because more moving parts are involved.
What we found so far:
(disclaimer: I don't actually know the EDAC subsystem at all and all of the below is from a cursory reading of the docs)
Icinga had not reported these because it uses the metrics from node-exporter, which themselves are backed by the sysfs files in /sys/devices. All of them report 0 on that machine:
cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat $(find -name '*ce*_count*' ) | grep -v '^0$' cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc %
The issue doesn't seem to be something having touched reset_counters, as seconds_since_reset gives a value of something like 119 days (which as it turns out is the uptime of the machine):
cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat */seconds_since_reset 10276896 10276896
However, the errors did make their way to the kernel and software stack on the machine.
cdanis@cp4026.ulsfo.wmnet /var/log % head mcelog Hardware event. This is not a software error. MCE 0 CPU 0 Undecoded extended event ff TSC 3a231039a1fd59 ADDR 45f92081c0 TIME 1545453235 Sat Dec 22 04:33:55 2018 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid
Although they've been rotated out of disk in /var/log, they're still in the dmesg buffer:
cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | head -n3 [7447051.961511] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x45f9208 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) [7447056.470250] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x4678bc8 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) [7447057.201331] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x486a1c8 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1) cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | wc -l 18
Full log at https://phabricator.wikimedia.org/P8031
They also made their way into rsyslog where they were picked up by our mtail rules:
https://grafana.wikimedia.org/d/nApOnklmk/xxx-cdanis-edac-events?panelId=2&fullscreen&orgId=1&from=1545282909855&to=1545711454290