Some important hardware events (most notably thermal throttling in our case) are logged and processed by mcelog, by default logging to /var/log/mcelog. We should surface those events as metrics and alert on them, the signal to noise ratio seems to be quite high.
In terms of implementation it would be nice to have error counts exported as Prometheus metrics. There seem at least two ways to implement events -> metrics in this case:
- Parse /var/log/mcelog with mtail.
- Pros: simple, already tested.
- Cons: requires an additional daemon (mtail) running on each baremetal host, parsing strings from a log file in fragile e.g when mcelog changes its messages.
- Run a custom mcelog trigger. This would be increment an error-specific counter on the machine and exit. A separate cron script dumps all counters in the form of metrics to a plain text file for node_exporter to pick up.
- Pros: no additional daemons required, the triggers can be generally useful to other people in the same situation too, more robust since the error details are in environment variables
- Cons: development time required