Page MenuHomePhabricator

Report problems found by mcelog
Open, MediumPublic

Description

Some important hardware events (most notably thermal throttling in our case) are logged and processed by mcelog, by default logging to /var/log/mcelog. We should surface those events as metrics and alert on them, the signal to noise ratio seems to be quite high.

In terms of implementation it would be nice to have error counts exported as Prometheus metrics. There seem at least two ways to implement events -> metrics in this case:

  1. Parse /var/log/mcelog with mtail.
    1. Pros: simple, already tested.
    2. Cons: requires an additional daemon (mtail) running on each baremetal host, parsing strings from a log file in fragile e.g when mcelog changes its messages.
  2. Run a custom mcelog trigger. This would be increment an error-specific counter on the machine and exit. A separate cron script dumps all counters in the form of metrics to a plain text file for node_exporter to pick up.
    1. Pros: no additional daemons required, the triggers can be generally useful to other people in the same situation too, more robust since the error details are in environment variables
    2. Cons: development time required

Event Timeline

Joe triaged this task as Medium priority.Jun 18 2018, 2:52 PM
Joe subscribed.
Vvjjkkii renamed this task from Report problems found by mcelog to p4aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.

I think this work has mostly already happened? We have some mtail rules for mce events.
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mtail/files/programs/kernel.mtail
There's also a separate counter just for thermal throttling.

Since this mtail program runs on the rsyslog hosts I believe the only thing left to do is to make sure we're plotting things as we like? Certainly the data is making its way to Prometheus. The only thing I can think of is we might want some rewrite rules to add a cluster label based on hostname.

I think this work has mostly already happened? We have some mtail rules for mce events.
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mtail/files/programs/kernel.mtail
There's also a separate counter just for thermal throttling.

Since this mtail program runs on the rsyslog hosts I believe the only thing left to do is to make sure we're plotting things as we like? Certainly the data is making its way to Prometheus. The only thing I can think of is we might want some rewrite rules to add a cluster label based on hostname.

Agreed this is mostly done, modulo plotting/dashboarding. Not sure if it'd be easy and/or worth it to have hostname -> cluster rewrite rules available, though we can evaluate that later too.