Page MenuHomePhabricator

smart-data-dump should fail loudly when it can't gather metrics
Open, MediumPublic

Description

Noticed this while investigating something else: smart-data-dump on e.g. ms-be HP hosts isn't gathering any data (due to a facter raid name change I think) but it is still happy with the result:

root@ms-be1056:~# /usr/local/sbin/smart-data-dump --debug
DEBUG:__main__:Fact raid discovered: ['md', 'ssacli']
DEBUG:__main__:Gathering SMART data from physical disks: []
# HELP device_smart_available_reservd_space SMART attribute available_reservd_space
# TYPE device_smart_available_reservd_space gauge
# HELP device_smart_offline_uncorrectable SMART attribute offline_uncorrectable
# TYPE device_smart_offline_uncorrectable gauge
# HELP device_smart_command_timeout SMART attribute command_timeout
...
root@ms-be1056:~# echo $?
0

Event Timeline

A cursory look shows two standing problems that are related and possibly blocking:

  1. it cannot handle a configuration where a host has both mixed RAID and standalone disks (Note: mdraid is considered standalone disks)
  2. it cannot emit logs that aren't high noise as they go directly to cronspam

Given the size of both of these problems, they likely merit their own tasks.

A short term solution occurs to me. What if on each run it exported the number of disks it has detected. It's seems reasonable to assume that a host indicating it has no disks is misbehaving in some way.

I think it is fair to say that if no disks are detected then that's always an error condition (?) In that case I think a simple(r) solution would be to exit non-zero if no disks are detected so the systemd service/timer fails loudly.

The related problem to this one is also that the hp raid controller name changed when we changed the facter invocation, and ssacli isn't recognized as such by smart_data_dump

The related problem to this one is also that the hp raid controller name changed when we changed the facter invocation, and ssacli isn't recognized as such by smart_data_dump

Upon further investigation this doesn't seem to correct, as in I'm getting ssacli as raid fact with either invocation. The underlying issue though remains :|

root@ms-be1056:~# facter --puppet --json -l error raid
{
  "raid": [
    "md",
    "ssacli"
  ]
}
root@ms-be1056:~# /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb
{"raid":["md","ssacli"]}
jijiki triaged this task as Medium priority.Nov 10 2020, 4:10 PM

Per discussion on IRC, we know two things:

  1. The proposed change to exit non-zero when no disks are detected is very likely to be extremely noisy given we do not know the extent of the problem. The noise generated is on the order of (affected hosts)x(check frequency).
  2. The fix is non-trivial given that it requires adding a long-requested feature to enable collection of smart metrics for hosts containing both standalone disks and raid. This necessitates some form of deduplication subroutine to detect which disks are handled by the raid controller, which are not, and query them as appropriate for the way the disks are attached to the host. It also necessitates onboarding at least one new raid controller type (ssacli).

Given this situation, we will:

  1. Add a metric which counts the number of disks detected by smart_data_dump
  2. Improve smart_data_dump in response to the hosts currently affected
  3. Add this feature so that future hosts indicate the problem early

Change 640473 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: add metric to track number of devices detected

https://gerrit.wikimedia.org/r/640473

Change 640473 merged by Cwhite:
[operations/puppet@production] smart: add metric to track number of devices detected

https://gerrit.wikimedia.org/r/640473

25 hosts affected

device_smart_device_count < 1