Maniphest T197086

Report problems found by mcelog
Open, MediumPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Jun 13 2018, 10:22 AM

Description

Some important hardware events (most notably thermal throttling in our case) are logged and processed by mcelog, by default logging to /var/log/mcelog. We should surface those events as metrics and alert on them, the signal to noise ratio seems to be quite high.

In terms of implementation it would be nice to have error counts exported as Prometheus metrics. There seem at least two ways to implement events -> metrics in this case:

Parse /var/log/mcelog with mtail.
1. Pros: simple, already tested.
2. Cons: requires an additional daemon (mtail) running on each baremetal host, parsing strings from a log file in fragile e.g when mcelog changes its messages.
Run a custom mcelog trigger. This would be increment an error-specific counter on the machine and exit. A separate cron script dumps all counters in the form of metrics to a plain text file for node_exporter to pick up.
1. Pros: no additional daemons required, the triggers can be generally useful to other people in the same situation too, more robust since the error details are in environment variables
2. Cons: development time required

Related Objects

Mentioned In: T183177: memory errors not showing in icinga

Event Timeline

fgiunchedi created this task.Jun 13 2018, 10:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 13 2018, 10:22 AM

fgiunchedi mentioned this in T183177: memory errors not showing in icinga.Jun 13 2018, 10:24 AM

Joe triaged this task as Medium priority.Jun 18 2018, 2:52 PM

Joe subscribed.

• Vvjjkkii renamed this task from Report problems found by mcelog to p4aaaaaaaa.Jul 1 2018, 1:04 AM

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

CommunityTechBot renamed this task from p4aaaaaaaa to Report problems found by mcelog.Jul 2 2018, 12:31 PM

CommunityTechBot lowered the priority of this task from High to Medium.

CommunityTechBot updated the task description. (Show Details)

CommunityTechBot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

CommunityTechBot added a subscriber: Aklapper.

I think this work has mostly already happened? We have some mtail rules for mce events.
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mtail/files/programs/kernel.mtail
There's also a separate counter just for thermal throttling.

Since this mtail program runs on the rsyslog hosts I believe the only thing left to do is to make sure we're plotting things as we like? Certainly the data is making its way to Prometheus. The only thing I can think of is we might want some rewrite rules to add a cluster label based on hostname.

In T197086#4865980, @CDanis wrote:

I think this work has mostly already happened? We have some mtail rules for mce events.
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mtail/files/programs/kernel.mtail
There's also a separate counter just for thermal throttling.

Since this mtail program runs on the rsyslog hosts I believe the only thing left to do is to make sure we're plotting things as we like? Certainly the data is making its way to Prometheus. The only thing I can think of is we might want some rewrite rules to add a cluster label based on hostname.

Agreed this is mostly done, modulo plotting/dashboarding. Not sure if it'd be easy and/or worth it to have hostname -> cluster rewrite rules available, though we can evaluate that later too.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:05 PM

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:17 PM

Report problems found by mcelogOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Report problems found by mcelog
Open, MediumPublic
Actions