Page MenuHomePhabricator

memory errors not showing in icinga
Closed, ResolvedPublic

Description

While doing unrelated work onsite at ulsfo, I noticed that cp4032 had a memory error on the LCD. I've filed T183176 to fix that particular server's hardware.

However, these errors should show up in icinga, and not rely for onsite visits to notice the issue, or for system crashes.

At the time of the memory error on cp4032, icinga showed that system as all green. System is still online, but with one of its dimm slots disabled, which should show some non-optimal status on icinga.

Event Timeline

RobH triaged this task as High priority.Dec 18 2017, 7:48 PM
RobH created this task.

Outcome from today's monitoring meeting: needs more investigation wrt we can get the hardware errors status from e.g. ipmi or linux directly. Another option is also looking at mce logs, assuming the same type/quantity of errors are reported.

@akosiaris suggested also edac-tools and that reminded me we're exporting edac metrics from node-exporter:

tin:~$ curl -s localhost:9100/metrics | grep -i ^node_edac
node_edac_correctable_errors_total{controller="0"} 0
node_edac_csrow_correctable_errors_total{controller="0",csrow="0"} 0
node_edac_csrow_correctable_errors_total{controller="0",csrow="1"} 0
node_edac_csrow_correctable_errors_total{controller="0",csrow="unknown"} 0
node_edac_csrow_uncorrectable_errors_total{controller="0",csrow="0"} 0
node_edac_csrow_uncorrectable_errors_total{controller="0",csrow="1"} 0
node_edac_csrow_uncorrectable_errors_total{controller="0",csrow="unknown"} 0
node_edac_uncorrectable_errors_total{controller="0"} 0

Found a couple more via reboots: T190540. Not good that we're having uncorrected memory errors go unreported/alerted...

Change 422110 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: alert on edac correctable errors

https://gerrit.wikimedia.org/r/422110

Change 422115 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb backups: Skip x1 and misc hosts this week

https://gerrit.wikimedia.org/r/422115

Change 422115 merged by Jcrespo:
[operations/puppet@production] mariadb backups: Skip x1 and misc hosts this week

https://gerrit.wikimedia.org/r/422115

See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that went by unnoticed (but we tend to notice on reboots).

I think there's a few things we need to think about with this situation in the general case:

  1. Uncorrectable errors (UE) are events, not states. A UE happens, and then life moves on. Other than persistent SEL logs or syslogs, we don't expect an isolated transient event to persist (it's not like the DIMM itself stores some kind of SMART-like data on past failures of itself or whatever). It's technically possible for an error to be truly-transient and never come back (e.g. "cosmic rays" or whatever). But a pattern of UE (or really, even a significant pattern of CE) is a sign that a module needs replacing.
  2. When a UE hits memory that matters (corrupts memory actually in-use for data/code), the kernel should panic, as it's the only reasonable recourse at that point. Clearly, that's not currently happening via kernel or userspace tools/settings.
  3. Either via the kernel interfaces directly, or via userspace edac tools, *something* should be logging UEs (well if they don't panic) and CEs to syslog. I think prometheus looks at sysfs directly.
  4. There were in times past, sysfs settings controlling panic_on_ue, log_ue, and log_ce, but these all seem to be missing from present kernels on cp*. Likely this stuff changed since I last looked, maybe that's considered userspace responsibility at this point?
  5. We don't currently install edac-utils
  6. We need something persisting this information in a useful way, so that we know it's happening and the information isn't lost to the wind on reboots.
  7. Worst case, perhaps we need to poll the SEL directly for this stuff, as currently it's the only seemingly-reliable way to know.

I've taken a first stab at reporting uncorrectable errors in https://gerrit.wikimedia.org/r/c/422110/ as reported by the kernel, so at least there will be icinga alerts for those. It will clear on reboot, though the CE/UE errors from the kernel are also in syslog and thus we keep it for the standard 90d on central syslog hosts.

@BBlack what do you think? IMO there's value in the alert at least as a starting point, next step would be (patterns of) correctable errors for example and addressing the points you listed.

Change 422110 merged by Filippo Giunchedi:
[operations/puppet@production] base: alert on EDAC correctable errors

https://gerrit.wikimedia.org/r/422110

Change 431750 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: sum EDAC correctable errors

https://gerrit.wikimedia.org/r/431750

Change 431750 merged by Filippo Giunchedi:
[operations/puppet@production] base: sum EDAC correctable errors

https://gerrit.wikimedia.org/r/431750

Mentioned in SAL (#wikimedia-operations) [2018-05-08T15:26:36Z] <godog> (un)load edac kernel modules on thumbor1004 to test resetting counters - T183177

The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and there seem to be a few different "classes" or "states":

  1. high count of CEs and recent kernel messages
  2. low count of CEs and no recent kernel messages

The course of action is to file tasks for class #1 to diagnose memory and reset edac counters (i.e. reload the edac kernel modules) for class #2 to probe for reoccurences.

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:22:07Z] <herron> cleared low count edac counters on hosts mw2205 dbstore1002 db1051 elastic1029 T183177

See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that went by unnoticed (but we tend to notice on reboots).

I think there's a few things we need to think about with this situation in the general case:

  1. Uncorrectable errors (UE) are events, not states. A UE happens, and then life moves on. Other than persistent SEL logs or syslogs, we don't expect an isolated transient event to persist (it's not like the DIMM itself stores some kind of SMART-like data on past failures of itself or whatever). It's technically possible for an error to be truly-transient and never come back (e.g. "cosmic rays" or whatever). But a pattern of UE (or really, even a significant pattern of CE) is a sign that a module needs replacing.
  2. When a UE hits memory that matters (corrupts memory actually in-use for data/code), the kernel should panic, as it's the only reasonable recourse at that point. Clearly, that's not currently happening via kernel or userspace tools/settings.
  3. Either via the kernel interfaces directly, or via userspace edac tools, *something* should be logging UEs (well if they don't panic) and CEs to syslog. I think prometheus looks at sysfs directly.
  4. There were in times past, sysfs settings controlling panic_on_ue, log_ue, and log_ce, but these all seem to be missing from present kernels on cp*. Likely this stuff changed since I last looked, maybe that's considered userspace responsibility at this point?

At the monitoring meeting yesterday I raise this issue and @akosiaris pointed out these are now edac_core module parameters:

# modinfo edac_core
filename:       /lib/modules/4.9.0-0.bpo.6-amd64/kernel/drivers/edac/edac_core.ko
description:    Core library routines for EDAC reporting
author:         Doug Thompson www.softwarebitmaker.com, et al
license:        GPL
depends:        
retpoline:      Y
intree:         Y
vermagic:       4.9.0-0.bpo.6-amd64 SMP mod_unload modversions 
parm:           check_pci_errors:Check for PCI bus parity errors: 0=off 1=on (int)
parm:           edac_pci_panic_on_pe:Panic on PCI Bus Parity error: 0=off 1=on (int)
parm:           edac_mc_panic_on_ue:Panic on uncorrected error: 0=off 1=on (int)
parm:           edac_mc_log_ue:Log uncorrectable error to console: 0=off 1=on (int)
parm:           edac_mc_log_ce:Log correctable error to console: 0=off 1=on (int)
parm:           edac_mc_poll_msec:Polling period in milliseconds
  1. We don't currently install edac-utils
  2. We need something persisting this information in a useful way, so that we know it's happening and the information isn't lost to the wind on reboots.
  3. Worst case, perhaps we need to poll the SEL directly for this stuff, as currently it's the only seemingly-reliable way to know.

Sounds like to me periodically polling SEL for relevant entries is worthwhile!

Change 433143 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: alert on correctable errors over a period of time

https://gerrit.wikimedia.org/r/433143

Change 433143 merged by Filippo Giunchedi:
[operations/puppet@production] base: alert on correctable errors over a period of time

https://gerrit.wikimedia.org/r/433143

I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking (or SIGBUS'ing the process) in case uncorrectable errors are reported.

Details below, it looks like for UE the kernel already does the right thing. On our side we don't currently monitor process exists for SIGBUS though, those would usually get restarted by systemd.

As per https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters

edac_mc_panic_on_ue - Panic on UE control file

An uncorrectable error will cause a machine panic. This is usually desirable. It is a bad idea to continue when an uncorrectable error occurs - it is indeterminate what was uncorrected and the operating system context might be so mangled that continuing will lead to further corruption. If the kernel has MCE configured, then EDAC will never notice the UE.

So while looking at MCE configuration for x86-64 for the mce kernel bootparameter from https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt:

mce=tolerancelevel[,monarchtimeout] (number,number)
		tolerance levels:
		0: always panic on uncorrected errors, log corrected errors
		1: panic or SIGBUS on uncorrected errors, log corrected errors
		2: SIGBUS or log uncorrected errors, log corrected errors
		3: never panic or SIGBUS, log all errors (for testing only)
		Default is 1
		Can be also set using sysfs which is preferable.
		monarchtimeout:
		Sets the time in us to wait for other CPUs on machine checks. 0
		to disable.

The respective sysfs file can be checked and it is already 1 by default:

grep -H . /sys/devices/system/machinecheck/machinecheck*/tolerant

I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking (or SIGBUS'ing the process) in case uncorrectable errors are reported.

Details below, it looks like for UE the kernel already does the right thing.

This conflicts with @BBlack's comment above though, T183177#4088202

When a UE hits memory that matters (corrupts memory actually in-use for data/code), the kernel should panic, as it's the only reasonable recourse at that point. Clearly, that's not currently happening via kernel or userspace tools/settings.

But

On our side we don't currently monitor process exists for SIGBUS though, those would usually get restarted by systemd.

Maybe that's the key here?

As per https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters

edac_mc_panic_on_ue - Panic on UE control file

An uncorrectable error will cause a machine panic. This is usually desirable. It is a bad idea to continue when an uncorrectable error occurs - it is indeterminate what was uncorrected and the operating system context might be so mangled that continuing will lead to further corruption. If the kernel has MCE configured, then EDAC will never notice the UE.

So while looking at MCE configuration for x86-64 for the mce kernel bootparameter from https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt:

mce=tolerancelevel[,monarchtimeout] (number,number)
		tolerance levels:
		0: always panic on uncorrected errors, log corrected errors
		1: panic or SIGBUS on uncorrected errors, log corrected errors
		2: SIGBUS or log uncorrected errors, log corrected errors
		3: never panic or SIGBUS, log all errors (for testing only)
		Default is 1
		Can be also set using sysfs which is preferable.
		monarchtimeout:
		Sets the time in us to wait for other CPUs on machine checks. 0
		to disable.

The respective sysfs file can be checked and it is already 1 by default:

grep -H . /sys/devices/system/machinecheck/machinecheck*/tolerant

It's not clear to me when the kernel panics and when the it sends a SIGBUS to the process if mce=tolerancelevel 1 (which is the default)

fgiunchedi claimed this task.

I'm resolving this task since we're alerting on uncorrectable memory errors found by EDAC now. Uncorrectable errors get either a kernel panic or SIGBUS to the process. See T197084: Report problems found in server's IPMI SEL and more importantly T197086: Report problems found by mcelog for followups.

I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking (or SIGBUS'ing the process) in case uncorrectable errors are reported.

Details below, it looks like for UE the kernel already does the right thing.

This conflicts with @BBlack's comment above though, T183177#4088202

When a UE hits memory that matters (corrupts memory actually in-use for data/code), the kernel should panic, as it's the only reasonable recourse at that point. Clearly, that's not currently happening via kernel or userspace tools/settings.

But

On our side we don't currently monitor process exists for SIGBUS though, those would usually get restarted by systemd.

Maybe that's the key here?

As per https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters

edac_mc_panic_on_ue - Panic on UE control file

An uncorrectable error will cause a machine panic. This is usually desirable. It is a bad idea to continue when an uncorrectable error occurs - it is indeterminate what was uncorrected and the operating system context might be so mangled that continuing will lead to further corruption. If the kernel has MCE configured, then EDAC will never notice the UE.

So while looking at MCE configuration for x86-64 for the mce kernel bootparameter from https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt:

mce=tolerancelevel[,monarchtimeout] (number,number)
		tolerance levels:
		0: always panic on uncorrected errors, log corrected errors
		1: panic or SIGBUS on uncorrected errors, log corrected errors
		2: SIGBUS or log uncorrected errors, log corrected errors
		3: never panic or SIGBUS, log all errors (for testing only)
		Default is 1
		Can be also set using sysfs which is preferable.
		monarchtimeout:
		Sets the time in us to wait for other CPUs on machine checks. 0
		to disable.

The respective sysfs file can be checked and it is already 1 by default:

grep -H . /sys/devices/system/machinecheck/machinecheck*/tolerant

It's not clear to me when the kernel panics and when the it sends a SIGBUS to the process if mce=tolerancelevel 1 (which is the default)

AFAICT the action taken depends on the "severity" of the error and when it happened (e.g. if the cpu was in kernel or user mode, at least from a cursory look at https://elixir.bootlin.com/linux/v4.9.82/source/arch/x86/kernel/cpu/mcheck/mce-severity.c) There's also more gory details at http://www.halobates.de/mce.pdf which I just scanned tho

I opened T197084: Report problems found in server's IPMI SEL to get SEL info somehow into alerting!

T214516 was a case of a memory error but Icinga did not detect it? T214516#4903917

@Dzahn investigation on the mystery of cp4026 ongoing in T214529

Change 509365 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] monitoring: add notes url for memory errors

https://gerrit.wikimedia.org/r/509365

The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and there seem to be a few different "classes" or "states":

  1. high count of CEs and recent kernel messages
  2. low count of CEs and no recent kernel messages

The course of action is to file tasks for class #1 to diagnose memory and reset edac counters (i.e. reload the edac kernel modules) for class #2 to probe for reoccurences.

FYI I created a wikitech page from this and created a change to add it to the alert notes_url

Change 509365 merged by Jbond:
[operations/puppet@production] monitoring: add notes url for memory errors

https://gerrit.wikimedia.org/r/509365

Change 520654 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/elasticsearch: fix notes_link->notes_url parameter name

https://gerrit.wikimedia.org/r/520654

Change 520654 merged by Dzahn:
[operations/puppet@production] icinga/elasticsearch: fix notes_link->notes_url parameter name

https://gerrit.wikimedia.org/r/520654

Change 520656 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/elasticsearch: fix notes_url->dashboard_links param name

https://gerrit.wikimedia.org/r/520656

Change 520656 merged by Dzahn:
[operations/puppet@production] icinga/elasticsearch: remove notes_url param where it does not belong

https://gerrit.wikimedia.org/r/520656