In T214516 @RobH happened to notice correctable memory errors on `cp4026`. But nothing had fired in Icinga. Investigation so far is below.
One possible option for now is to convert the Icinga rule to use the `mtail`-generated stats, although that seems suboptimal in the long run because more moving parts are involved.
What we found so far:
(disclaimer: I don't actually know the EDAC subsystem at all and all of the below is from a cursory reading of the docs)
Icinga had not reported these because it uses the metrics from node-exporter, which themselves are backed by the sysfs files in `/sys/devices`. All of them report 0 on that machine:
```
cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat $(find -name '*ce*_count*' ) | grep -v '^0$'
cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc %
```
The issue doesn't seem to be something having touched `reset_counters`, as `seconds_since_reset` gives a value of something like 119 days:
```
cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat */seconds_since_reset
10276896
10276896
```
However, the errors did make their way to the kernel and software stack on the machine.
```
cdanis@cp4026.ulsfo.wmnet /var/log % head mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 Undecoded extended event ff TSC 3a231039a1fd59
ADDR 45f92081c0
TIME 1545453235 Sat Dec 22 04:33:55 2018
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid
```
Although they've been rotated out of disk in `/var/log`, they're still in the `dmesg` buffer:
```
cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | head -n3
[7447051.961511] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x45f9208 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1)
[7447056.470250] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x4678bc8 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1)
[7447057.201331] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x486a1c8 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1)
cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | wc -l
18
```
They also made their way into `rsyslog` where they were picked up by our `mtail` rules:
https://grafana.wikimedia.org/d/nApOnklmk/xxx-cdanis-edac-events?panelId=2&fullscreen&orgId=1&from=1545282909855&to=1545711454290