There are some important smart attributes that can help identify failing disks, e.g. number of offline or pending sectors, the online self-test and so on. We should keep track of those and monitor/alarm accordingly.
- Install smartmontools on the machines we're checking SMART for (and eventually fleetwide on baremetal)
- By default smartmontools will start smartd and check all removable devices and send email on failure, it might be desirable to compare results with our checking. Either way decide whether or not smartd should be running. NB smartd by default smartd will happily keep emailing daily about problem it has found, if we have the daemon running we'll need to avoid duplicate emails.
- Gather a list of physical disks on the machine and their parameters to be accessible via smartctl.
- For each physical disk gather its smart attributes, ultimately calling smartctl and parsing its output (with pySMART or manually)
- Export the list of attributes and their values as Prometheus metrics into a text file for node_exporter to pick up.
- On the alerting side, in icinga either use check_prometheus with a query to the Prometheus server or another check that would connect to node_exporter and check the relevant smart attributes we're interested in.