Change Details

There are some important smart attributes that can help identify failing disks, e.g. number of offline or pending sectors, the online self-test and so on. We should keep track of those and monitor/alarm accordingly. An implementation plan was discussed with @fgiunchedi and @Volans at FOSDEM 2017, to recap: [x] Install `smartmontools` on the machines we're checking SMART for (and eventually fleetwide on baremetal) [x] By default `smartmontools` will start `smartd` and check all removable devices and send email on failure, it might be desirable to compare results with our checking. Either way decide whether or not `smartd` should be running. **NB** `smartd` by default `smartd` will happily keep emailing daily about problem it has found, if we have the daemon running we'll need to avoid duplicate emails. [x] Gather a list of physical disks on the machine and their parameters to be accessible via `smartctl`. [x] For each physical disk gather its smart attributes, ultimately calling `smartctl` and parsing its output (with [[ https://github.com/smartxworks/pySMART | pySMART]] or manually) [x] Export the list of attributes and their values as Prometheus metrics into a text file for `node_exporter` to pick up. [x] On the alerting side, in icinga either use `check_prometheus` with a query to the Prometheus server or another check that would connect to `node_exporter` and check the relevant smart attributes we're interested in.