Page MenuHomePhabricator

Smart alert on labstore1006 and labstore1007
Open, NormalPublic

Description

This task is to track activity from the icinga alert for the smart checker:

Service: Device not healthy -SMART-

cluster=misc device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1006:9100 job=node site=eqiad

Event Timeline

Bstorm created this task.Jul 10 2018, 4:46 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 10 2018, 4:46 PM

This appears to be a problem with the monitor more than the array.

chasemp triaged this task as Normal priority.Jul 10 2018, 4:51 PM

Here's output from one of the disks it doesn't like, for a sample:

root@labstore1006:~# smartctl -i -H -d cciss,19 /dev/sg3
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-0.bpo.6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MB6000JVYYV
Revision:             HPD2
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5009564109f
Serial number:        ZA1AXBB0
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Jul 10 16:51:14 2018 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

This same alert shows on labstore1007, which is no surprise at all since it is configured the same way.

Bstorm renamed this task from Smart alert on labstore1006 to Smart alert on labstore1006 and labstore1007.Aug 24 2018, 6:10 PM

Moving to the watching column because this ticket is really just in case there's anything we need to do regarding this false alarm.