Page MenuHomePhabricator

Handle SMART for multiple shelves and controllers
Closed, ResolvedPublic

Description

labstore1006's controller now has multiple shelves attached to it, though smart-data-dump isn't smart enough (hah!) to handle this case. Also note that there are multiple controllers, and we should support that too.

root@labstore1006:~# smart-data-dump --debug
DEBUG:__main__:Fact 'raid' discovered: ['hpsa']
DEBUG:__main__:Gathering SMART data from physical disks: ['cciss,0', 'cciss,1', 'cciss,2', 'cciss,3', 'cciss,4', 'cciss,5', 'cciss,6', 'cciss,7', 'cciss,8', 'cciss,9', 'cciss,10', 'cciss,11', 'cciss,12', 'cciss,13', 'cciss,14', 'cciss,15', 'cciss,16', 'cciss,17', 'cciss,18', 'cciss,19', 'cciss,20', 'cciss,21', 'cciss,22', 'cciss,23', 'cciss,0', 'cciss,1', 'cciss,2', 'cciss,3', 'cciss,4', 'cciss,5', 'cciss,6', 'cciss,7', 'cciss,8', 'cciss,9', 'cciss,10', 'cciss,11', 'cciss,12', 'cciss,13']

though starting with cciss,14 the smartctl invocation starts to fail and disks are reported as not healthy:

DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --info --health -d cciss,14 /dev/sda
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --attributes -d cciss,14 /dev/sda
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --info --health -d cciss,15 /dev/sda
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --attributes -d cciss,15 /dev/sda
DEBUG:__main__:Running: /usr/bin/timeout 60 /usr/sbin/smartctl --info --health -d cciss,16 /dev/sda
...

Event Timeline

fgiunchedi triaged this task as Medium priority.Jul 10 2018, 3:33 PM
fgiunchedi created this task.
fgiunchedi renamed this task from Handle SMART for multiple shelves attached to a single smartarray controller to Handle SMART for multiple shelves and controllers.Jul 11 2018, 10:23 AM
fgiunchedi updated the task description. (Show Details)

Change 587370 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] raid: add lsscsi to required packages for hpsa raid

https://gerrit.wikimedia.org/r/587370

Change 587370 merged by Cwhite:
[operations/puppet@production] raid: add lsscsi to required packages for hpsa raid

https://gerrit.wikimedia.org/r/587370

RobH removed a subscriber: RobH.Apr 8 2020, 6:12 PM

Change 587795 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: make smart_data_dump importable for adding tests

https://gerrit.wikimedia.org/r/587795

Change 587811 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: add _check_output wrapper method and tests

https://gerrit.wikimedia.org/r/587811

Change 587816 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] abstract parsing from data gathering and add tests

https://gerrit.wikimedia.org/r/587816

Change 587877 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: add tests for _parse_smart_info and _parse_smart_attributes

https://gerrit.wikimedia.org/r/587877

Change 588515 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: simplify PD

https://gerrit.wikimedia.org/r/588515

Change 587795 merged by Cwhite:
[operations/puppet@production] smart: make smart_data_dump importable for adding tests

https://gerrit.wikimedia.org/r/587795

Change 588759 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: move metrics registry and metrics init to global

https://gerrit.wikimedia.org/r/588759

Change 588769 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: add multiple hpsa controller support

https://gerrit.wikimedia.org/r/588769

Change 587811 merged by Cwhite:
[operations/puppet@production] smart: add _check_output wrapper method and tests

https://gerrit.wikimedia.org/r/587811

Change 592755 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: disable timeout fetching facts

https://gerrit.wikimedia.org/r/592755

Change 592755 merged by Cwhite:
[operations/puppet@production] smart: disable timeout fetching facts

https://gerrit.wikimedia.org/r/592755

Dzahn removed a subscriber: Dzahn.Apr 28 2020, 7:33 AM

Change 592993 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: set facter timeout to three minutes

https://gerrit.wikimedia.org/r/592993

Change 592993 merged by Cwhite:
[operations/puppet@production] smart: set facter timeout to three minutes

https://gerrit.wikimedia.org/r/592993

Change 587816 merged by Cwhite:
[operations/puppet@production] smart: abstract parsing from data gathering and add tests

https://gerrit.wikimedia.org/r/587816

Change 587877 merged by Cwhite:
[operations/puppet@production] smart: add tests for _parse_smart_info and _parse_smart_attributes

https://gerrit.wikimedia.org/r/587877

Change 588515 merged by Cwhite:
[operations/puppet@production] smart: simplify PD

https://gerrit.wikimedia.org/r/588515

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.May 6 2020, 10:00 AM

Change 588759 merged by Cwhite:
[operations/puppet@production] smart: prepare collect_smart_metrics for handling devices of different types

https://gerrit.wikimedia.org/r/588759

Change 588769 merged by Cwhite:
[operations/puppet@production] smart: add multiple hpsa controller support

https://gerrit.wikimedia.org/r/588769

Change 594989 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] smart: add multiple hpsa controller support

https://gerrit.wikimedia.org/r/594989

Change 594989 merged by Cwhite:
[operations/puppet@production] smart: add multiple hpsa controller support

https://gerrit.wikimedia.org/r/594989

colewhite closed this task as Resolved.May 7 2020, 6:02 PM

Deployed multiple hpsa controller support and things are looking good. Will continue to monitor over the next few days.