Page MenuHomePhabricator

Alert on ECC warnings in SEL
Open, MediumPublic

Description

The s4 master paged because of faulty memory (https://phabricator.wikimedia.org/T253808)

Looking at SEL we had a window of ~ 18 hours were we could have noticed this in advance and e.g. done a failover to a different server:

4   | May-27-2020 | 01:33:44 | Mem ECC Warning  | Memory                      | transition to Critical from less severe
5   | May-27-2020 | 20:20:26 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error

We should options to monitor/alert for this (at least for critical systems like DB masters).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2020, 8:53 PM

Definitely +1 to alert on sth like this, it is yet unclear to me if we can alert in a platform-independent way. Anyways see also "prior art" in T214516: cp4026 correctable dimm error where clearly these events are not reported to EDAC and thus surfaced with existing metrics :(

Definitely +1 to alert on sth like this, it is yet unclear to me if we can alert in a platform-independent way.

It doesn't need to be entirely platform independent, after all we only have two vendors :-)

Anyways see also "prior art" in T214516: cp4026 correctable dimm error where clearly these events are not reported to EDAC and thus surfaced with existing metrics :(

It's the same as for db1138; the existing EDAC check is also entirely green ATM in Icinga:

Memory correctable errors -EDAC-  : (C)4 ge (W)2 ge 0
CDanis added a subscriber: CDanis.May 29 2020, 8:13 PM

I've PoC this with check_ipmi_sensor which supports checking SEL, for example:

/usr/local/lib/nagios/plugins/check_ipmi_sensor -f freeipmi_config -H db1138.mgmt.eqiad.wmnet  -T Temperature -T Power_Supply -T Memory --noentityabsent  --nothresholds -ST Memory,Processor
Sensor Type(s) Temperature, Power_Supply, Memory Status: Critical [System Firmware Mem ECC Warning = Critical, System Firmware ECC Uncorr Err = Critical] | 'Temp'=55.00 'Temp'=57.00 'Inlet Temp'=21.00 'Exhaust Temp'=54.00

vs an healthy host

/usr/local/lib/nagios/plugins/check_ipmi_sensor -f freeipmi_config -H prometheus1003.mgmt.eqiad.wmnet  -T Temperature -T Power_Supply -T Memory --noentityabsent  --nothresholds -ST Memory,Processor
Sensor Type(s) Temperature, Power_Supply, Memory Status: OK | 'Inlet Temp'=21.00 'Temp'=63.00 'Temp'=61.00

The downside of this approach is potentially old SEL entries that we'll have to clear as they are surfaced on first deployment. Going forward, the SEL will need clearing for such errors to let the icinga alert actually clear. Since if we deploy this we'll be routinely clear the SEL on errors, I think it is important to log its entries elsewhere too and for that we can deploy freeipmi-ipmiseld which polls SEL and logs to syslog.

Dzahn triaged this task as Medium priority.Jun 4 2020, 9:20 AM
CDanis added a comment.Jun 4 2020, 4:18 PM

I hesitate to ask, but, did anyone check if mcelog recorded anything for
these events?

Yeah MCE got it, similar logs than parent task but from centrallog

May 27 20:19:45 db1138 kernel: [33134188.608450] mce: [Hardware Error]: Machine check events logged
May 27 20:19:45 db1138 kernel: [33134188.615864] {1}Hardware error detected on CPU2
May 27 20:19:45 db1138 kernel: [33134188.615874] {1}event severity: recoverable
May 27 20:19:45 db1138 kernel: [33134188.615875] {1} Error 0, type: recoverable
May 27 20:19:45 db1138 kernel: [33134188.615876] {1} fru_text: B4
May 27 20:19:45 db1138 kernel: [33134188.615876] {1}  section_type: memory error
May 27 20:19:45 db1138 kernel: [33134188.615877] {1}  error_status: 0x0000000000000400
May 27 20:19:45 db1138 kernel: [33134188.615878] {1}  physical_address: 0x0000007d3c38f580
May 27 20:19:45 db1138 kernel: [33134188.615880] {1}  node: 3 card: 0 module: 0 rank: 0 bank: 1 row: 55982 column: 1016 
May 27 20:19:45 db1138 kernel: [33134188.615882] {1}  DIMM location: not present. DMI handle: 0x0000 
May 27 20:19:45 db1138 mysqld[163407]: 200527 20:19:45 [ERROR] mysqld got signal 7 ;
May 27 20:19:45 db1138 mysqld[163407]: This could be because you hit a bug. It is also possible that this binary
May 27 20:19:45 db1138 mysqld[163407]: or one of the libraries it was linked against is corrupt, improperly built,
May 27 20:19:45 db1138 mysqld[163407]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 27 20:19:45 db1138 mysqld[163407]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
May 27 20:19:45 db1138 mysqld[163407]: We will try our best to scrape up some info that will hopefully help
May 27 20:19:45 db1138 mysqld[163407]: diagnose the problem, but since we have already crashed,
May 27 20:19:45 db1138 mysqld[163407]: something is definitely wrong and this may fail.
May 27 20:19:45 db1138 mysqld[163407]: Server version: 10.1.43-MariaDB
May 27 20:19:45 db1138 mysqld[163407]: key_buffer_size=134217728
May 27 20:19:45 db1138 mysqld[163407]: read_buffer_size=131072
May 27 20:19:45 db1138 mysqld[163407]: max_used_connections=841
May 27 20:19:45 db1138 mysqld[163407]: max_threads=2010
May 27 20:19:45 db1138 mysqld[163407]: thread_count=83
May 27 20:19:45 db1138 mysqld[163407]: It is possible that mysqld could use up to
May 27 20:19:45 db1138 mysqld[163407]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4712774 K  bytes of memory
May 27 20:19:45 db1138 mysqld[163407]: Hope that's ok; if not, decrease some variables in the equation.
May 27 20:19:53 db1138 mysqld[163407]: Thread pointer: 0x0
May 27 20:19:53 db1138 mysqld[163407]: Attempting backtrace. You can use the following information to find out
May 27 20:19:53 db1138 mysqld[163407]: where mysqld died. If you see no messages after this, something went
May 27 20:19:53 db1138 mysqld[163407]: terribly wrong...
May 27 20:19:53 db1138 mysqld[163407]: stack_bottom = 0x0 thread_stack 0x30000
May 27 20:19:53 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e)[0x55a4fc53710e]
May 27 20:19:53 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd)[0x55a4fc077b8d]
May 27 20:20:10 db1138 mysqld[163407]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7ff6314420e0]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x995c5b)[0x55a4fc458c5b]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x999cb8)[0x55a4fc45ccb8]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x999dd4)[0x55a4fc45cdd4]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99a2d6)[0x55a4fc45d2d6]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99afa4)[0x55a4fc45dfa4]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99b591)[0x55a4fc45e591]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99dd06)[0x55a4fc460d06]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x9a0881)[0x55a4fc463881]
May 27 20:20:30 db1138 sshd[61000]: Connection from 208.80.153.74 port 60086 on 10.64.48.124 port 22
May 27 20:20:30 db1138 sshd[61000]: Connection closed by 208.80.153.74 port 60086 [preauth]
May 27 20:20:43 db1138 mysqld[163407]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7ff6314384a4]
May 27 20:21:00 db1138 mysqld[163407]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ff62fbd6d0f]
May 27 20:21:00 db1138 mysqld[163407]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
May 27 20:21:00 db1138 mysqld[163407]: information that should help you find out what is causing the crash.
CDanis added a comment.Jun 5 2020, 4:43 PM

That was for the fatal, I was wondering about the earlier correctable errors. But I just checked on centrallog1001 and I didn't find any evidence that the correctable errors were recorded by mcelog.