There are some important smart attributes that can help identify failing disks, e.g. number of offline or pending sectors, the online self-test and so on. We should keep track of those and monitor/alarm accordingly.
An implementation plan was discussed with @fgiunchedi and @Volans at FOSDEM 2017, to recap:
[x] Install `smartmontools` on the machines we're checking SMART for (and eventually fleetwide on baremetal)
[x] By default `smartmontools` will start `smartd` and check all removable devices and send email on failure, it might be desirable to compare results with our checking. Either way decide whether or not `smartd` should be running. **NB** `smartd` by default `smartd` will happily keep emailing daily about problem it has found, if we have the daemon running we'll need to avoid duplicate emails.
[x] Gather a list of physical disks on the machine and their parameters to be accessible via `smartctl`.
[x] For each physical disk gather its smart attributes, ultimately calling `smartctl` and parsing its output (with [[ https://github.com/smartxworks/pySMART | pySMART]] or manually)
[x] Export the list of attributes and their values as Prometheus metrics into a text file for `node_exporter` to pick up.
[] On the alerting side, in icinga either use `check_prometheus` with a query to the Prometheus server or another check that would connect to `node_exporter` and check the relevant smart attributes we're interested in.