Page MenuHomePhabricator

SMART data dump healthy metric can contain None
Open, Needs TriagePublic

Description

While investigating T267748 I noticed that for inaccessible devices we're setting None as the device name. I believe we can set it to the correct device name found during discovery

# HELP device_smart_healthy SMART health
# TYPE device_smart_healthy gauge
device_smart_healthy{device="2I:2:1"} 1.0
device_smart_healthy{device="1I:1:6"} 1.0
device_smart_healthy{device="2I:4:2"} 1.0
device_smart_healthy{device="2I:2:3"} 1.0
device_smart_healthy{device="1I:1:5"} 1.0
device_smart_healthy{device="1I:1:8"} 1.0
device_smart_healthy{device="2I:2:2"} 1.0
device_smart_healthy{device="2I:2:4"} 1.0
device_smart_healthy{device="1I:1:2"} 1.0
device_smart_healthy{device="1I:1:3"} 1.0
device_smart_healthy{device="None"} 0.0
device_smart_healthy{device="1I:1:4"} 1.0
device_smart_healthy{device="1I:1:1"} 1.0
device_smart_healthy{device="1I:1:7"} 1.0

Related Objects

Event Timeline

It appears that using smartctl to query disks that the (HP) raid controller has marked as failed fails:

Smartctl open device: /dev/sg0 [cciss_disk_13] [SCSI/SAT] failed: INQUIRY [SAT]: No such device or address

None shows up because model, firmware, and serial are expected to come from smartctl.

Since hpssacli indicates the disk failure, we should probably use data from this output when available.

Indeed the underlying disk was failed and marked as such by the controller. To be clear in this case IIRC we have the disk name from hpssacli already and I think we should use it in reporting at least device_smart_healthy without requiring smartctl to return valid output