Page MenuHomePhabricator

Evaluate/integrate rasdaemon as a replacement for mcelog
Closed, ResolvedPublic

Description

mcelog depends on the legacy /dev/mcelog interface, which was deprecated/disabled in Debian in Linux 4.12. The mcelog package also eventually got removed from Debian and is only available in stretch: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=889741

rasdaemon should replace it, but this needs to be studied further and our hardware failure monitoring needs to be migrated to it. We can do this for >= stretch, so that it works out of the box for buster.

https://packages.qa.debian.org/r/rasdaemon.html

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2018, 9:33 AM
MoritzMuehlenhoff triaged this task as Medium priority.Sep 25 2018, 9:41 AM
MoritzMuehlenhoff updated the task description. (Show Details)
jbond claimed this task.Feb 6 2019, 4:03 PM
jbond added a subscriber: jbond.

Change 489635 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Only install mcelog on jessie and stretch

https://gerrit.wikimedia.org/r/489635

Change 489635 merged by Muehlenhoff:
[operations/puppet@production] Only install mcelog on jessie and stretch

https://gerrit.wikimedia.org/r/489635

CDanis added a subscriber: CDanis.Feb 11 2019, 2:01 PM

Change 490042 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] Add rasdaemon service to systems which support it.

https://gerrit.wikimedia.org/r/490042

It appears that how to make Prometheus node_exporter play nice with rasdaemon is an unresolved issue:
https://github.com/prometheus/node_exporter/issues/986

One of the easier options would be to write a textfile exporter for whatever rasdaemon stuff we do care about.
(Although I am still unsure as to how rasdaemon and the EDAC subsystem interact for the metrics we care about.)

jbond added a comment.Feb 12 2019, 2:13 PM

rasdaemon writes data to a sqlit3 file located in /var/lib/rasdaemon/ras-mc_event.db im not sure the format other then below but perhaps we could use that

 .schema
CREATE TABLE mc_event (id INTEGER PRIMARY KEY, timestamp TEXT, err_count INTEGER, err_type TEXT, err_msg TEXT, label TEXT, mc INTEGER, top_layer INTEGER, middle_layer INTEGER, lower_layer INTEGER, address INTEGER, grain INTEGER, syndrome INTEGER, driver_detail TEXT);
CREATE TABLE aer_event (id INTEGER PRIMARY KEY, timestamp TEXT, err_type TEXT, err_msg TEXT);
CREATE TABLE extlog_event (id INTEGER PRIMARY KEY, timestamp TEXT, etype INTEGER, error_count INTEGER, severity INTEGER, address INTEGER, fru_id BLOB, fru_text TEXT, cper_data BLOB);
CREATE TABLE mce_record (id INTEGER PRIMARY KEY, timestamp TEXT, mcgcap INTEGER, mcgstatus INTEGER, status INTEGER, addr INTEGER, misc INTEGER, ip INTEGER, tsc INTEGER, walltime INTEGER, cpu INTEGER, cpuid INTEGER, apicid INTEGER, socketid INTEGER, cs INTEGER, bank INTEGER, cpuvendor INTEGER, bank_name TEXT, error_msg TEXT, mcgstatus_msg TEXT, mcistatus_msg TEXT, mcastatus_msg TEXT, user_action TEXT, mc_location TEXT);
CREATE TABLE arm_event (id INTEGER PRIMARY KEY, timestamp TEXT, error_count INTEGER, affinity INTEGER, mpidr INTEGER, running_state INTEGER, psci_state INTEGER);

Change 490042 merged by Jbond:
[operations/puppet@production] Add rasdaemon service to systems which support it.

https://gerrit.wikimedia.org/r/490042

jbond closed this task as Resolved.Feb 14 2019, 11:59 AM

rasdaemon is now part of the buster policy

CDanis reopened this task as Open.Feb 14 2019, 10:24 PM
CDanis claimed this task.

@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues

Change 490787 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] rasdaemon: add hiera overrides for testing on stretch

https://gerrit.wikimedia.org/r/490787

Change 490787 merged by CDanis:
[operations/puppet@production] rasdaemon: add hiera overrides for testing on stretch

https://gerrit.wikimedia.org/r/490787

Change 490855 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] thumbor1004: can't install rasdaemon if there's no stretch

https://gerrit.wikimedia.org/r/490855

Change 490855 merged by CDanis:
[operations/puppet@production] thumbor1004: can't install rasdaemon if there's no stretch

https://gerrit.wikimedia.org/r/490855

Change 490856 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] mw2206: install rasdaemon

https://gerrit.wikimedia.org/r/490856

Change 490856 merged by CDanis:
[operations/puppet@production] mw2206: install rasdaemon

https://gerrit.wikimedia.org/r/490856

we got one:

1Feb 15 15:09:52 mw2206 kernel: [14254431.027746] mce: [Hardware Error]: Machine check events logged
2Feb 15 15:09:52 mw2206 kernel: [14254431.027780] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
3Feb 15 15:09:52 mw2206 kernel: [14254431.027793] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
4Feb 15 15:09:52 mw2206 kernel: [14254431.027798] EDAC sbridge MC0: TSC 0
5Feb 15 15:09:52 mw2206 kernel: [14254431.027800] EDAC sbridge MC0: ADDR 2bb8b1000
6Feb 15 15:09:52 mw2206 kernel: [14254431.027801] EDAC sbridge MC0: MISC 90000080008228c
7Feb 15 15:09:52 mw2206 kernel: [14254431.027804] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1550243392 SOCKET 0 APIC 0
8Feb 15 15:09:52 mw2206 kernel: [14254431.027835] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)
9Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (987) ras:mc_event with new print handler
10Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (986) ras:aer_event with new print handler
11Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (96) mce:mce_record with new print handler
12Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (988) ras:extlog_mem_event with new print handler
13Feb 15 15:09:52 mw2206 rasdaemon[26701]: Calling ras_mc_event_opendb()
14Feb 15 15:09:52 mw2206 rasdaemon[26701]: cpu 00:rasdaemon: mce_record store: 0x558d58e06f28
15Feb 15 15:09:52 mw2206 mcelog: warning: 16 bytes ignored in each record
16Feb 15 15:09:52 mw2206 mcelog: consider an update
17Feb 15 15:09:52 mw2206 rasdaemon[26701]: rasdaemon: register inserted at db
18Feb 15 15:09:52 mw2206 rasdaemon[26701]: <idle>-0 [2003682560] 1.425447: mce_record: 2019-02-15 15:09:52 +0000 bank=b, status= 8c000051000800c2, MEMORY CONTROLLER MS_CHANNEL2_ERR Transaction: Memory scrubbing error Corrected patrol scrub error, mci=Corrected_error, n_errors=1 memory_channel=2 ranks=-1 and -1, cpu_type= Ivy Bridge EP/EX, cpu= 0, socketid= 0, misc= 90000080008228c, addr= 2bb8b1000, mcgstatus=0, mcgcap= 1000c1b, apicid= 0
19Feb 15 15:09:52 mw2206 rasdaemon[26701]: cpu 00:rasdaemon: mc_event store: 0x558d58e029f8
20Feb 15 15:09:52 mw2206 rasdaemon[26701]: rasdaemon: register inserted at db

Change 490890 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] thumbor1004: install rasdaemon post-jessie-backport

https://gerrit.wikimedia.org/r/490890

Change 490890 merged by CDanis:
[operations/puppet@production] thumbor1004: install rasdaemon post-jessie-backport

https://gerrit.wikimedia.org/r/490890

in under 10 minutes after installing rasdaemon on thumbor1004 we also saw one there. that machine is such a consistent performer:

1Feb 15 17:37:00 thumbor1004 kernel: [340944.806495] mce: [Hardware Error]: Machine check events logged
2Feb 15 17:37:00 thumbor1004 kernel: [340944.806517] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
3Feb 15 17:37:00 thumbor1004 kernel: [340944.806523] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
4Feb 15 17:37:00 thumbor1004 kernel: [340944.806524] EDAC sbridge MC1: TSC 0
5Feb 15 17:37:00 thumbor1004 kernel: [340944.806525] EDAC sbridge MC1: ADDR cc68f4000
6Feb 15 17:37:00 thumbor1004 kernel: [340944.806526] EDAC sbridge MC1: MISC 90840800080208c
7Feb 15 17:37:00 thumbor1004 kernel: [340944.806527] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1550252220 SOCKET 1 APIC 20
8Feb 15 17:37:00 thumbor1004 kernel: [340944.806548] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
9Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (987) ras:mc_event with new print handler
10Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (986) ras:aer_event with new print handler
11Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (96) mce:mce_record with new print handler
12Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (988) ras:extlog_mem_event with new print handler
13Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: Calling ras_mc_event_opendb()
14Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: cpu 01:rasdaemon: mce_record store: 0x12794e8
15Feb 15 17:37:00 thumbor1004 mcelog: warning: 16 bytes ignored in each record
16Feb 15 17:37:00 thumbor1004 mcelog: consider an update
17Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: rasdaemon: register inserted at db
18Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: <idle>-0 [4281920] 0.034092: mce_record: 2019-02-15 17:37:00 +0000 bank=a, status= 8c000050000800c1, MEMORY CONTROLLER MS_CHANNEL1_ERR Transaction: Memory scrubbing error Corrected patrol scrub error, mci=Corrected_error, n_errors=1 memory_channel=1 ranks=-1 and -1, cpu_type= Ivy Bridge EP/EX, cpu= 1, socketid= 1, misc= 90840800080208c, addr= cc68f4000, mcgstatus=0, mcgcap= 1000c19, apicid= 20
19Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: cpu 01:rasdaemon: mc_event store: 0x1273a88
20Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: rasdaemon: register inserted at db

Change 494220 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove mcelog on systems which were upgraded from stretch

https://gerrit.wikimedia.org/r/494220

Change 494220 merged by Muehlenhoff:
[operations/puppet@production] Remove mcelog on systems which were upgraded from stretch

https://gerrit.wikimedia.org/r/494220

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:17 PM
CDanis closed this task as Resolved.Jul 20 2020, 3:26 PM

Change 623027 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base: remove override and conditionals for rasdaemon install

https://gerrit.wikimedia.org/r/623027

@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues

@CDanis while reviewing the PS from Dzahn i noticed that the backport has the wrong version number i.e. deb8u1 vs deb9u1. This is not a problem but if we still plan to install this on all stretch servers it would be good to fix it. So i wondered if this is still something you want to push to the stretch machines. If not ill just delete the package from stretch-wikimedia and remove it from thumbor1004 (the only stretch box to currently have it )

Change 623027 abandoned by Dzahn:
[operations/puppet@production] base: remove override and conditionals for rasdaemon install

Reason:

https://gerrit.wikimedia.org/r/623027

@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues

@CDanis while reviewing the PS from Dzahn i noticed that the backport has the wrong version number i.e. deb8u1 vs deb9u1. This is not a problem but if we still plan to install this on all stretch servers it would be good to fix it. So i wondered if this is still something you want to push to the stretch machines. If not ill just delete the package from stretch-wikimedia and remove it from thumbor1004 (the only stretch box to currently have it )

It's been a while since I've had context here but I think it's fine to just let this happen with the buster migration.

Change 623760 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] rasdaemon: only install rasdaemon on buster systems

https://gerrit.wikimedia.org/r/623760

Change 623760 merged by Jbond:
[operations/puppet@production] rasdaemon: only install rasdaemon on buster systems

https://gerrit.wikimedia.org/r/623760

jbond added a comment.Wed, Sep 2, 9:53 AM

It's been a while since I've had context here but I think it's fine to just let this happen with the buster migration.

ack i have cleaned up puppet so this is only installed on buster and removed it from thumbor1004

We would like to roll out Kernel 4.19 on some stretch hosts and if I got this right we will need the backported rasdaemon version rather than the one available in stretch, right?
@jbond can we maybe get it back to stretch-wikimedia?

We would like to roll out Kernel 4.19 on some stretch hosts and if I got this right we will need the backported rasdaemon version rather than the one available in stretch, right?
@jbond can we maybe get it back to stretch-wikimedia?

I'd say if 0.5.8-1 from standard Stretch works fine, let's just stick with it, but otherwise a backport of 0.6.0-1.2 from buster won't hurt either.

We would like to roll out Kernel 4.19 on some stretch hosts and if I got this right we will need the backported rasdaemon version rather than the one available in stretch, right?
@jbond can we maybe get it back to stretch-wikimedia?

I'd say if 0.5.8-1 from standard Stretch works fine, let's just stick with it, but otherwise a backport of 0.6.0-1.2 from buster won't hurt either.

Okay. I was in the assumption that the backport was done because we had issues with the stretch version. Fine than!