Page MenuHomePhabricator

EDAC events not being reported by node-exporter?
Closed, ResolvedPublic

Description

In T214516 @RobH happened to notice correctable memory errors on cp4026. But nothing had fired in Icinga. Investigation so far is below.

One possible option for now is to convert the Icinga rule to use the mtail-generated stats, although that seems suboptimal in the long run because more moving parts are involved.

What we found so far:

(disclaimer: I don't actually know the EDAC subsystem at all and all of the below is from a cursory reading of the docs)

Icinga had not reported these because it uses the metrics from node-exporter, which themselves are backed by the sysfs files in /sys/devices. All of them report 0 on that machine:

cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat $(find -name '*ce*_count*' ) | grep -v '^0$'
cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc %

The issue doesn't seem to be something having touched reset_counters, as seconds_since_reset gives a value of something like 119 days (which as it turns out is the uptime of the machine):

cdanis@cp4026.ulsfo.wmnet /sys/devices/system/edac/mc % cat */seconds_since_reset                       
10276896
10276896

However, the errors did make their way to the kernel and software stack on the machine.

cdanis@cp4026.ulsfo.wmnet /var/log % head mcelog  
Hardware event. This is not a software error.
MCE 0
CPU 0 Undecoded extended event ff TSC 3a231039a1fd59 
ADDR 45f92081c0 
TIME 1545453235 Sat Dec 22 04:33:55 2018
MCG status:
MCi status:
Corrected error
Error enabled
MCi_ADDR register valid

Although they've been rotated out of disk in /var/log, they're still in the dmesg buffer:

cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | head -n3
[7447051.961511] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x45f9208 offset:0x1c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1)
[7447056.470250] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x4678bc8 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1)
[7447057.201331] EDAC MC1: 0 CE memory read error on CPU_SrcID#1_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x486a1c8 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:1 ha:1 channel_mask:1 rank:1)
cdanis@cp4026.ulsfo.wmnet /var/log % sudo dmesg | grep EDAC | grep error | wc -l   
18

Full log at https://phabricator.wikimedia.org/P8031

They also made their way into rsyslog where they were picked up by our mtail rules:
https://grafana.wikimedia.org/d/nApOnklmk/xxx-cdanis-edac-events?panelId=2&fullscreen&orgId=1&from=1545282909855&to=1545711454290

Event Timeline

One more observation:

17:07:28 <cdanis> FTR, db1068, mw2213, and thumbor1004 have all seen multiple EDAC errors/day in the last week
17:08:11 <cdanis> and elastic1029 had one earlier this week

Mentioned in SAL (#wikimedia-operations) [2019-01-24T18:49:13Z] <cdanis> cp4026: T214529: apt-get install'ing edac-utils with new deps libedac1 libsysfs2

cdanis@cp4026.ulsfo.wmnet ~ % edac-util -v  
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#1_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#1_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#1_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#1_Chan#1_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#1_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#1_Chan#1_DIMM#0: 0 Corrected Errors
edac-util: No errors to report.

Looking at strace output, turns out that edac-util reads the exact same /sys files that node-exporter does.

Mentioned in SAL (#wikimedia-operations) [2019-01-24T19:02:37Z] <cdanis> T214529: cdanis@cp4026.ulsfo.wmnet ~ % sudo apt-get --purge remove edac-utils libsysfs2 libedac1

I did a cumin run across the whole fleet to find hosts that have memory errors in their dmesg buffers, but haven't incremented counters beyond 0.

cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p99  '*' 'sudo dmesg | grep -q "MCE MEMORY ERROR" && (test -n "$(curl -s localhost:9100/metrics | grep edac | egrep -v \ 0$ | grep -v ^\# | grep -v collector)" && echo memory errors occurred, counters working properly || echo memory errors occurred, counters were not incremented) || echo no memory errors found within dmesg buffers'
===== NODE GROUP =====                                                                                                                                                 
(17) cp[1049,1053,1066].eqiad.wmnet,db[1068-1069].eqiad.wmnet,elastic1029.eqiad.wmnet,kafka1023.eqiad.wmnet,mw[2137,2170,2189,2213].codfw.wmnet,mw[1239,1256].eqiad.wmnet,thumbor1004.eqiad.wmnet,wtp[2011,2013,2020].codfw.wmnet                                                                                                             
----- OUTPUT of 'sudo dmesg | gre...in dmesg buffers' -----                                                                                                            
memory errors occurred, counters working properly                                                                                                                      
===== NODE GROUP =====                                                                                                                                                 
(1) ms-be2047.codfw.wmnet                                                                                                                                              
----- OUTPUT of 'sudo dmesg | gre...in dmesg buffers' -----                                                                                                            
ssh: connect to host ms-be2047.codfw.wmnet port 22: Connection timed out                                                                                               
===== NODE GROUP =====                                                                                                                                                 
(18) analytics1050.eqiad.wmnet,cp[2011,2017].codfw.wmnet,cp[5006-5007].eqsin.wmnet,cp[3034,3042].esams.wmnet,cp4026.ulsfo.wmnet,db2087.codfw.wmnet,dbstore1002.eqiad.wmnet,kafka-jumbo1005.eqiad.wmnet,mw[2268,2283].codfw.wmnet,mw[1264,1300,1302,1319].eqiad.wmnet,wtp1045.eqiad.wmnet                                                      
----- OUTPUT of 'sudo dmesg | gre...in dmesg buffers' -----                                                                                                            
memory errors occurred, counters were not incremented

The remaining hosts did not have any MCE MEMORY ERROR messages in their dmesg buffers, which is sadly only a heuristic for if memory errors have occurred recently -- for instance, kafka1004 is quite troubled and such messages from a week ago have been pushed out of its buffers by an upsetting amount of thermal throttling messages.

Now to try to figure out what is in common between the hosts that have counters still at 0...

Change 486507 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Add an Icinga alert for syslog-reported EDAC events, as a workaround for whatever is causing T214529.

https://gerrit.wikimedia.org/r/486507

Change 486507 merged by CDanis:
[operations/puppet@production] icinga alert for syslogged EDACs

https://gerrit.wikimedia.org/r/486507

jijiki triaged this task as Medium priority.Feb 7 2019, 1:00 PM

Talked some with @BBlack today, who observed that there are in fact a variety of drivers that back this stuff in the kernel, and that it's very possible we're getting all events delivered to us, but that the counters can be back by firmware which makes its own decision as to whether or not to increment said counters.
I did a bunch of digging and failed to find any obvious correlation between this behavior being observed and kernel version. There is somewhat of a correlation between the approximate 'age' of the platform -- it correlates with newer BIOS version dates, newer server model numbers, more feature flags enabled in CPUs -- but there's nothing definitive there.

BBlack also linked me to https://patchwork.kernel.org/patch/9846929/ which has a bunch of discussion, including this tidbit from a poster who hacks on Linux for HP:

Corrected errors are normal and expected to occur on healthy
hardware. They do not need user's attention until they repeatedly
occurred at a same place.

Apparently, you haven't been on enough maintanance calls, trying to
calm down the customer about the hardware error he sees in his
logs...

Actually, that's why. Reporting all corrected errors make users
worried, call support, and asking to replace healthy hardware...

So it seems possible that nothing is actually 'wrong' here? But I have very little confidence in anything at this point.

Corrected errors are normal and expected to occur on healthy
hardware. They do not need user's attention until they repeatedly
occurred at a same place.

Apparently, you haven't been on enough maintanance calls, trying to
calm down the customer about the hardware error he sees in his
logs...

Actually, that's why. Reporting all corrected errors make users
worried, call support, and asking to replace healthy hardware...

So it seems possible that nothing is actually 'wrong' here? But I have very little confidence in anything at this point.

Right, this is kind of a tricky part of this subject. Sometimes random CEs happen and never recur, and it's a waste for everyone involved to be alerting on them and/or requesting hardware replacements for them. Sometimes they repeat and they're a sign of slightly-faulty hardware which is now more likely to fully fail eventually and should be preemptively replaced. With CEs faithfully reported, it's up to the consumer to decide about reasonable thresholds and/or argue with support about when replacements should be made. If it's a critical server you can't afford unplanned downtimes on, you probably care more about preempting such problems with planned maintenance. If it's a redundant cluster of many servers you might opt to just not care about CEs until something really fails. I could buy the argument that a reasonable hardware vendor, plagued with support requests asking for new DIMMs because of logged one-shot CEs, might make a firmware change to simply not report one-shot CEs and wait to notify the operating system / user until multiple errors have been logged on the same module, or to hide CEs altogether perhaps, on lower-end boxes at least?

However, in the case that prompted looking at this, nothing but the procfs counts in Linux are under-reporting the CEs. They were causing a front panel alert (which is how Rob noticed), they were present in Dell's BMC log stuff, and they're apparently sent to the kernel in some fashion because they are getting logged, it's just the counts that aren't updating. You'd think if there was an intentional change on the vendor's part to suppress pointless CEs, they'd suppress them harder and we'd never hear about it through any channel. So I still suspect this is more of a bug-level issue somewhere (bios, firmware, hardware, kernel, drivers?). Should we worry about the bug though, or just make sure we like our UE policy (e.g. panic) and forgo caring about bugs related to CEs? That one's a hardware question. I'd probably be ok with ignoring CEs on the cache clusters, on the balance of all things about impact and cost/benefit, but for a primary DB box we probably want to be more proactive and have a not-buggy way of tracking CEs so that we'd notice recurrent CEs indicating increasingly likelihood of a UE panic/crash?

Thanks for the deep investigation, truly fascinating! WRT what @BBlack was saying that the host was displaying errors on its LCD, I'm wondering if alerting on SEL errors (i.e. T197084) would yield a better signal to noise ratio instead. I haven't delved deep into what that would mean however or even if practical, my two cents!

Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of servers that are currently reporting / have recently reported memory issues:

cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'mw2206*,thumbor2004*,cp1053*,db1068*,mw1243*,elastic1029*' 'ipmi-sel --output-event-state -vv'
6 hosts will be targeted:
cp1053.eqiad.wmnet,db1068.eqiad.wmnet,elastic1029.eqiad.wmnet,mw2206.codfw.wmnet,mw1243.eqiad.wmnet,thumbor2004.codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                 
(1) cp1053.eqiad.wmnet                                                                                                                                                 
----- OUTPUT of 'ipmi-sel --output-event-state -vv' -----                                                                                                              
ID  | Date        | Time     | Name             | Type                     | State    | Event Direction   | Event                                                      
1   | Jun-06-2013 | 02:24:05 | SEL              | Event Logging Disabled   | Nominal  | Assertion Event   | Log Area Reset/Cleared                                     
===== NODE GROUP =====                                                                                                                                                 
(1) thumbor2004.codfw.wmnet                                                                                                                                            
----- OUTPUT of 'ipmi-sel --output-event-state -vv' -----                                                                                                              
ID  | Date        | Time     | Name             | Type                     | State    | Event Direction   | Event                                                      
1   | Dec-10-2016 | 04:12:33 | SEL              | Event Logging Disabled   | Nominal  | Assertion Event   | Log Area Reset/Cleared                                     
===== NODE GROUP =====                                                                                                                                                 
(1) elastic1029.eqiad.wmnet                                                                                                                                            
----- OUTPUT of 'ipmi-sel --output-event-state -vv' -----                                                                                                              
ID  | Date        | Time     | Name             | Type                     | State    | Event Direction   | Event                                                      
1   | Oct-06-2014 | 10:01:16 | SEL              | Event Logging Disabled   | Nominal  | Assertion Event   | Log Area Reset/Cleared                                     
2   | Oct-23-2014 | 13:56:59 | Intrusion        | Physical Security        | Critical | Assertion Event   | General Chassis Intrusion ; OEM Event Data2 code = 02h
3   | Oct-23-2014 | 13:57:04 | Intrusion        | Physical Security        | Critical | Deassertion Event | General Chassis Intrusion ; OEM Event Data2 code = 02h
4   | Aug-03-2017 | 14:22:24 | Intrusion        | Physical Security        | Critical | Assertion Event   | General Chassis Intrusion ; OEM Event Data2 code = 02h
5   | Aug-03-2017 | 14:22:30 | Intrusion        | Physical Security        | Critical | Deassertion Event | General Chassis Intrusion ; OEM Event Data2 code = 02h
6   | Aug-16-2018 | 16:19:51 | Intrusion        | Physical Security        | Critical | Assertion Event   | General Chassis Intrusion ; OEM Event Data2 code = 02h
7   | Aug-16-2018 | 16:19:57 | Intrusion        | Physical Security        | Critical | Deassertion Event | General Chassis Intrusion ; OEM Event Data2 code = 02h
===== NODE GROUP =====                                                                                                                                                 
(1) mw1243.eqiad.wmnet                                                                                                                                                 
----- OUTPUT of 'ipmi-sel --output-event-state -vv' -----                                                                                                              
ID  | Date        | Time     | Name             | Type                     | State    | Event Direction   | Event                                                      
1   | Nov-12-2014 | 09:26:14 | SEL              | Event Logging Disabled   | Nominal  | Assertion Event   | Log Area Reset/Cleared                                     
===== NODE GROUP =====                                                                                                                                                 
(1) db1068.eqiad.wmnet                                                                                                                                                 
----- OUTPUT of 'ipmi-sel --output-event-state -vv' -----                                                                                                              
ID  | Date        | Time     | Name             | Type                     | State    | Event Direction   | Event                                                      
1   | Jan-25-2019 | 20:02:18 | SEL              | Event Logging Disabled   | Nominal  | Assertion Event   | Log Area Reset/Cleared                                     
===== NODE GROUP =====                                                                                                                                                 
(1) mw2206.codfw.wmnet                                                                                                                                                 
----- OUTPUT of 'ipmi-sel --output-event-state -vv' -----                                                                                                              
ID  | Date        | Time     | Name             | Type                     | State    | Event Direction   | Event                                                      
1   | Jan-15-2015 | 23:20:17 | SEL              | Event Logging Disabled   | Nominal  | Assertion Event   | Log Area Reset/Cleared                                     
================

Maybe I'm holding it wrong?

Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of servers that are currently reporting / have recently reported memory issues:

Interesting! In my previous comment I've assumed errors severe enough to be reported on the host's LCD or LEDs would be in SEL too, maybe that's not the case though? Anyways it seems something like that (i.e. "host little yellow/red LEDs went off, or the LCD is complaining") might yield an higher signal to noise ratio, not sure how complicated it is to achieve from inside Linux tho.

Change 490856 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] mw2206: install rasdaemon

https://gerrit.wikimedia.org/r/490856

Calling this resolved for now -- the mtail-based events are also being monitored by Icinga, and would have caught all the previous instances missed by the node_exporter/kernel counter stats.

Digging into why the latter isn't working as we expect doesn't seem like a great use of time.