smart-data-dump should fail loudly when it can't gather metrics
Open, MediumPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Nov 3 2020, 3:13 PM

Description

Noticed this while investigating something else: smart-data-dump on e.g. ms-be HP hosts isn't gathering any data (due to a facter raid name change I think) but it is still happy with the result:

root@ms-be1056:~# /usr/local/sbin/smart-data-dump --debug
DEBUG:__main__:Fact raid discovered: ['md', 'ssacli']
DEBUG:__main__:Gathering SMART data from physical disks: []
# HELP device_smart_available_reservd_space SMART attribute available_reservd_space
# TYPE device_smart_available_reservd_space gauge
# HELP device_smart_offline_uncorrectable SMART attribute offline_uncorrectable
# TYPE device_smart_offline_uncorrectable gauge
# HELP device_smart_command_timeout SMART attribute command_timeout
...
root@ms-be1056:~# echo $?
0

Details

	Subject	Repo	Branch	Lines +/-
	smart: add metric to track number of devices detected	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T86552 Monitor and alarm on SMART attributes [tracking]
Open	None	T267135 smart-data-dump should fail loudly when it can't gather metrics
Open	None	T267664 Enhance smart_data_dump to support gathering metrics from both raid and standalone disks
Open	None	T267660 Add ssacli support to smart_data_dump
Open	SLyngshede-WMF	T355461 Add perccli support to smart_data_dump

Event Timeline

fgiunchedi created this task.Nov 3 2020, 3:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 3 2020, 3:13 PM

fgiunchedi added a project: SRE.Nov 3 2020, 3:15 PM

A cursory look shows two standing problems that are related and possibly blocking:

it cannot handle a configuration where a host has both mixed RAID and standalone disks (Note: mdraid is considered standalone disks)
it cannot emit logs that aren't high noise as they go directly to cronspam

Given the size of both of these problems, they likely merit their own tasks.

A short term solution occurs to me. What if on each run it exported the number of disks it has detected. It's seems reasonable to assume that a host indicating it has no disks is misbehaving in some way.

colewhite moved this task from Inbox to Backlog on the observability board.Nov 6 2020, 8:13 PM

I think it is fair to say that if no disks are detected then that's always an error condition (?) In that case I think a simple(r) solution would be to exit non-zero if no disks are detected so the systemd service/timer fails loudly.

The related problem to this one is also that the hp raid controller name changed when we changed the facter invocation, and ssacli isn't recognized as such by smart_data_dump

In T267135#6615766, @fgiunchedi wrote:

The related problem to this one is also that the hp raid controller name changed when we changed the facter invocation, and ssacli isn't recognized as such by smart_data_dump

Upon further investigation this doesn't seem to correct, as in I'm getting ssacli as raid fact with either invocation. The underlying issue though remains :|

root@ms-be1056:~# facter --puppet --json -l error raid
{
  "raid": [
    "md",
    "ssacli"
  ]
}
root@ms-be1056:~# /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb
{"raid":["md","ssacli"]}

jijiki triaged this task as Medium priority.Nov 10 2020, 4:10 PM

Per discussion on IRC, we know two things:

The proposed change to exit non-zero when no disks are detected is very likely to be extremely noisy given we do not know the extent of the problem. The noise generated is on the order of (affected hosts)x(check frequency).
The fix is non-trivial given that it requires adding a long-requested feature to enable collection of smart metrics for hosts containing both standalone disks and raid. This necessitates some form of deduplication subroutine to detect which disks are handled by the raid controller, which are not, and query them as appropriate for the way the disks are attached to the host. It also necessitates onboarding at least one new raid controller type (ssacli).

Given this situation, we will:

Add a metric which counts the number of disks detected by smart_data_dump
Improve smart_data_dump in response to the hosts currently affected
Add this feature so that future hosts indicate the problem early

colewhite added a subtask: T267664: Enhance smart_data_dump to support gathering metrics from both raid and standalone disks.Nov 10 2020, 5:10 PM

colewhite added a subtask: T267660: Add ssacli support to smart_data_dump.

Change 640473 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: add metric to track number of devices detected

https://gerrit.wikimedia.org/r/640473

gerritbot added a project: Patch-For-Review.Nov 10 2020, 5:27 PM

Change 640473 merged by Cwhite:
[operations/puppet@production] smart: add metric to track number of devices detected

https://gerrit.wikimedia.org/r/640473

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2020, 9:10 PM

25 hosts affected

device_smart_device_count < 1

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:08 AM

colewhite added a subtask: T355461: Add perccli support to smart_data_dump.Jan 19 2024, 11:25 PM

colewhite added a parent task: T86552: Monitor and alarm on SMART attributes [tracking].

smart-data-dump should fail loudly when it can't gather metricsOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

smart-data-dump should fail loudly when it can't gather metrics
Open, MediumPublic
Actions

Related Objects
Search...