Page MenuHomePhabricator

Alert on ECC warnings in SEL
Open, MediumPublic

Description

The s4 master paged because of faulty memory (https://phabricator.wikimedia.org/T253808)

Looking at SEL we had a window of ~ 18 hours were we could have noticed this in advance and e.g. done a failover to a different server:

4   | May-27-2020 | 01:33:44 | Mem ECC Warning  | Memory                      | transition to Critical from less severe
5   | May-27-2020 | 20:20:26 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error

We should options to monitor/alert for this (at least for critical systems like DB masters).

Event Timeline

Definitely +1 to alert on sth like this, it is yet unclear to me if we can alert in a platform-independent way. Anyways see also "prior art" in T214516: cp4026 correctable dimm error where clearly these events are not reported to EDAC and thus surfaced with existing metrics :(

Definitely +1 to alert on sth like this, it is yet unclear to me if we can alert in a platform-independent way.

It doesn't need to be entirely platform independent, after all we only have two vendors :-)

Anyways see also "prior art" in T214516: cp4026 correctable dimm error where clearly these events are not reported to EDAC and thus surfaced with existing metrics :(

It's the same as for db1138; the existing EDAC check is also entirely green ATM in Icinga:

Memory correctable errors -EDAC-  : (C)4 ge (W)2 ge 0

I've PoC this with check_ipmi_sensor which supports checking SEL, for example:

/usr/local/lib/nagios/plugins/check_ipmi_sensor -f freeipmi_config -H db1138.mgmt.eqiad.wmnet  -T Temperature -T Power_Supply -T Memory --noentityabsent  --nothresholds -ST Memory,Processor
Sensor Type(s) Temperature, Power_Supply, Memory Status: Critical [System Firmware Mem ECC Warning = Critical, System Firmware ECC Uncorr Err = Critical] | 'Temp'=55.00 'Temp'=57.00 'Inlet Temp'=21.00 'Exhaust Temp'=54.00

vs an healthy host

/usr/local/lib/nagios/plugins/check_ipmi_sensor -f freeipmi_config -H prometheus1003.mgmt.eqiad.wmnet  -T Temperature -T Power_Supply -T Memory --noentityabsent  --nothresholds -ST Memory,Processor
Sensor Type(s) Temperature, Power_Supply, Memory Status: OK | 'Inlet Temp'=21.00 'Temp'=63.00 'Temp'=61.00

The downside of this approach is potentially old SEL entries that we'll have to clear as they are surfaced on first deployment. Going forward, the SEL will need clearing for such errors to let the icinga alert actually clear. Since if we deploy this we'll be routinely clear the SEL on errors, I think it is important to log its entries elsewhere too and for that we can deploy freeipmi-ipmiseld which polls SEL and logs to syslog.

Dzahn triaged this task as Medium priority.Jun 4 2020, 9:20 AM

I hesitate to ask, but, did anyone check if mcelog recorded anything for
these events?

Yeah MCE got it, similar logs than parent task but from centrallog

May 27 20:19:45 db1138 kernel: [33134188.608450] mce: [Hardware Error]: Machine check events logged
May 27 20:19:45 db1138 kernel: [33134188.615864] {1}Hardware error detected on CPU2
May 27 20:19:45 db1138 kernel: [33134188.615874] {1}event severity: recoverable
May 27 20:19:45 db1138 kernel: [33134188.615875] {1} Error 0, type: recoverable
May 27 20:19:45 db1138 kernel: [33134188.615876] {1} fru_text: B4
May 27 20:19:45 db1138 kernel: [33134188.615876] {1}  section_type: memory error
May 27 20:19:45 db1138 kernel: [33134188.615877] {1}  error_status: 0x0000000000000400
May 27 20:19:45 db1138 kernel: [33134188.615878] {1}  physical_address: 0x0000007d3c38f580
May 27 20:19:45 db1138 kernel: [33134188.615880] {1}  node: 3 card: 0 module: 0 rank: 0 bank: 1 row: 55982 column: 1016 
May 27 20:19:45 db1138 kernel: [33134188.615882] {1}  DIMM location: not present. DMI handle: 0x0000 
May 27 20:19:45 db1138 mysqld[163407]: 200527 20:19:45 [ERROR] mysqld got signal 7 ;
May 27 20:19:45 db1138 mysqld[163407]: This could be because you hit a bug. It is also possible that this binary
May 27 20:19:45 db1138 mysqld[163407]: or one of the libraries it was linked against is corrupt, improperly built,
May 27 20:19:45 db1138 mysqld[163407]: or misconfigured. This error can also be caused by malfunctioning hardware.
May 27 20:19:45 db1138 mysqld[163407]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
May 27 20:19:45 db1138 mysqld[163407]: We will try our best to scrape up some info that will hopefully help
May 27 20:19:45 db1138 mysqld[163407]: diagnose the problem, but since we have already crashed,
May 27 20:19:45 db1138 mysqld[163407]: something is definitely wrong and this may fail.
May 27 20:19:45 db1138 mysqld[163407]: Server version: 10.1.43-MariaDB
May 27 20:19:45 db1138 mysqld[163407]: key_buffer_size=134217728
May 27 20:19:45 db1138 mysqld[163407]: read_buffer_size=131072
May 27 20:19:45 db1138 mysqld[163407]: max_used_connections=841
May 27 20:19:45 db1138 mysqld[163407]: max_threads=2010
May 27 20:19:45 db1138 mysqld[163407]: thread_count=83
May 27 20:19:45 db1138 mysqld[163407]: It is possible that mysqld could use up to
May 27 20:19:45 db1138 mysqld[163407]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4712774 K  bytes of memory
May 27 20:19:45 db1138 mysqld[163407]: Hope that's ok; if not, decrease some variables in the equation.
May 27 20:19:53 db1138 mysqld[163407]: Thread pointer: 0x0
May 27 20:19:53 db1138 mysqld[163407]: Attempting backtrace. You can use the following information to find out
May 27 20:19:53 db1138 mysqld[163407]: where mysqld died. If you see no messages after this, something went
May 27 20:19:53 db1138 mysqld[163407]: terribly wrong...
May 27 20:19:53 db1138 mysqld[163407]: stack_bottom = 0x0 thread_stack 0x30000
May 27 20:19:53 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(my_print_stacktrace+0x2e)[0x55a4fc53710e]
May 27 20:19:53 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(handle_fatal_signal+0x3bd)[0x55a4fc077b8d]
May 27 20:20:10 db1138 mysqld[163407]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7ff6314420e0]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x995c5b)[0x55a4fc458c5b]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x999cb8)[0x55a4fc45ccb8]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x999dd4)[0x55a4fc45cdd4]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99a2d6)[0x55a4fc45d2d6]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99afa4)[0x55a4fc45dfa4]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99b591)[0x55a4fc45e591]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x99dd06)[0x55a4fc460d06]
May 27 20:20:27 db1138 mysqld[163407]: /opt/wmf-mariadb101/bin/mysqld(+0x9a0881)[0x55a4fc463881]
May 27 20:20:30 db1138 sshd[61000]: Connection from 208.80.153.74 port 60086 on 10.64.48.124 port 22
May 27 20:20:30 db1138 sshd[61000]: Connection closed by 208.80.153.74 port 60086 [preauth]
May 27 20:20:43 db1138 mysqld[163407]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7ff6314384a4]
May 27 20:21:00 db1138 mysqld[163407]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ff62fbd6d0f]
May 27 20:21:00 db1138 mysqld[163407]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
May 27 20:21:00 db1138 mysqld[163407]: information that should help you find out what is causing the crash.

That was for the fatal, I was wondering about the earlier correctable errors. But I just checked on centrallog1001 and I didn't find any evidence that the correctable errors were recorded by mcelog.

https://phabricator.wikimedia.org/T286763 is another instance where monitoring would have prevented a DB server from rebooting itself.

I've PoC this with check_ipmi_sensor which supports checking SEL
...
The downside of this approach is potentially old SEL entries that we'll have to clear as they are surfaced on first deployment. Going forward, the SEL will need clearing for such errors to let the icinga alert actually clear. Since if we deploy this we'll be routinely clear the SEL on errors, I think it is important to log its entries elsewhere too and for that we can deploy freeipmi-ipmiseld which polls SEL and logs to syslog.

Another option that may work around the need to clear host SELs is feeding log based ECC metrics from ipmiseld, and alerting based on those.

Getting ipmiseld in place would be a good next step IMO.

akosiaris added a subscriber: akosiaris.

Removing SRE, has already been triaged to a more specific SRE subteam

Change 901610 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] ipmi_exporter: add managed config file

https://gerrit.wikimedia.org/r/901610

Change 901610 merged by Jbond:

[operations/puppet@production] ipmi_exporter: add managed config file

https://gerrit.wikimedia.org/r/901610

Change 901674 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] ipmi: add configuration for ipmiseld

https://gerrit.wikimedia.org/r/901674

Change 901674 merged by Volans:

[operations/puppet@production] ipmi: add configuration for ipmiseld

https://gerrit.wikimedia.org/r/901674

Summary of today's work:

  1. ipmi_exporter run command was missing the configuration file to use, and hence missing the SEL metrics. This has been addressed.
  2. ipmiseld was not using some tuning configuration that will interpret the OEM messages and make them more readable. This has been addressed.
  3. ipmiseld logs to syslog but the logs are not making it to centrallog nor to logstash
  4. At this time is not clear if an alert based on the only SEL prometheus metrics available (ipmi_sel_logs_count and ipmi_sel_free_space_bytes) might be useful or just generate false positives and/or require anyway too much of a manual debugging after the alert. One alternative option could be to use a Redfish approach to query the BMC for the logs and use those to populate some task, either triggered manual or automatically (but the mgmt password is required).
Volans removed Volans as the assignee of this task.
Volans moved this task from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Volans added a subscriber: Volans.

Change 901680 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] prometheus::ipmi_exporter: update config to use inbuilt sudo option

https://gerrit.wikimedia.org/r/901680

Change 901680 merged by Jbond:

[operations/puppet@production] prometheus::ipmi_exporter: update config to use inbuilt sudo option

https://gerrit.wikimedia.org/r/901680

Change 902103 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/hardware: Add alert for sel events

https://gerrit.wikimedia.org/r/902103

Change 902103 merged by jenkins-bot:

[operations/alerts@master] team-sre/hardware: Add alert for sel events

https://gerrit.wikimedia.org/r/902103