Page MenuHomePhabricator

Monitor and alarm on SMART attributes [tracking]
Open, MediumPublic

Description

There are some important smart attributes that can help identify failing disks, e.g. number of offline or pending sectors, the online self-test and so on. We should keep track of those and monitor/alarm accordingly.

An implementation plan was discussed with @fgiunchedi and @Volans at FOSDEM 2017, to recap:

  • Install smartmontools on the machines we're checking SMART for (and eventually fleetwide on baremetal)
  • By default smartmontools will start smartd and check all removable devices and send email on failure, it might be desirable to compare results with our checking. Either way decide whether or not smartd should be running. NB smartd by default smartd will happily keep emailing daily about problem it has found, if we have the daemon running we'll need to avoid duplicate emails.
  • Gather a list of physical disks on the machine and their parameters to be accessible via smartctl.
  • For each physical disk gather its smart attributes, ultimately calling smartctl and parsing its output (with pySMART or manually)
  • Export the list of attributes and their values as Prometheus metrics into a text file for node_exporter to pick up.
  • On the alerting side, in icinga either use check_prometheus with a query to the Prometheus server or another check that would connect to node_exporter and check the relevant smart attributes we're interested in.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+14 -3
operations/puppetproduction+1 -1
operations/puppetproduction+13 -1
operations/puppetproduction+12 -0
operations/puppetproduction+2 -0
operations/puppetproduction+12 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -48
operations/puppetproduction+4 -4
operations/puppetproduction+4 -4
operations/puppetproduction+4 -4
operations/puppetproduction+8 -4
operations/puppetproduction+1 -1
operations/puppetproduction+20 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+4 -4
operations/puppetproduction+11 -4
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+14 -4
operations/puppetproduction+1 -5
operations/puppetproduction+6 -5
operations/puppetproduction+7 -1
operations/puppetproduction+5 -9
operations/puppetproduction+9 -0
operations/puppetproduction+5 -0
operations/puppetproduction+340 -0
operations/puppetproduction+20 -1
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fgiunchedi moved this task from Inbox to In progress on the observability board.Aug 28 2017, 4:18 PM
fgiunchedi updated the task description. (Show Details)Aug 30 2017, 1:25 PM

Change 378039 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] [WIP] smart: new module

https://gerrit.wikimedia.org/r/378039

Change 378039 merged by Filippo Giunchedi:
[operations/puppet@production] smart: new module

https://gerrit.wikimedia.org/r/378039

Change 383528 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: add check_smart to selectively enable ::smart class

https://gerrit.wikimedia.org/r/383528

Change 383529 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: rollout check_smart on a subset of codfw hosts

https://gerrit.wikimedia.org/r/383529

Change 383528 merged by Filippo Giunchedi:
[operations/puppet@production] profile: add check_smart to selectively enable ::smart class

https://gerrit.wikimedia.org/r/383528

Change 383529 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: rollout check_smart on a subset of codfw hosts

https://gerrit.wikimedia.org/r/383529

Change 385187 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: install hourly cron

https://gerrit.wikimedia.org/r/385187

Change 385187 merged by Filippo Giunchedi:
[operations/puppet@production] smart: install hourly cron

https://gerrit.wikimedia.org/r/385187

Change 386603 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: expand SMART health check rollout in codfw

https://gerrit.wikimedia.org/r/386603

Change 386603 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: expand SMART health check rollout in codfw

https://gerrit.wikimedia.org/r/386603

Change 386630 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: exclude maps-test hosts from smart health check

https://gerrit.wikimedia.org/r/386630

Change 386630 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: exclude maps-test hosts from smart health check

https://gerrit.wikimedia.org/r/386630

Change 388056 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: rollout smart health check to codfw

https://gerrit.wikimedia.org/r/388056

Change 388057 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: add ensure metaparameter

https://gerrit.wikimedia.org/r/388057

Change 388056 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: rollout smart health check to codfw

https://gerrit.wikimedia.org/r/388056

Change 388067 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] labstore: use require_package

https://gerrit.wikimedia.org/r/388067

Change 388067 merged by Filippo Giunchedi:
[operations/puppet@production] labstore: use require_package

https://gerrit.wikimedia.org/r/388067

Change 389484 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: enable SMART health collection in esams

https://gerrit.wikimedia.org/r/389484

Change 389485 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: enable SMART health collection in ulsfo

https://gerrit.wikimedia.org/r/389485

Change 388057 merged by Filippo Giunchedi:
[operations/puppet@production] smart: add ensure metaparameter

https://gerrit.wikimedia.org/r/388057

Change 389484 merged by Filippo Giunchedi:
[operations/puppet@production] smart: enable SMART health collection in esams

https://gerrit.wikimedia.org/r/389484

RobH awarded a token.Nov 7 2017, 4:32 PM

Change 389485 merged by Filippo Giunchedi:
[operations/puppet@production] smart: enable SMART health collection in ulsfo

https://gerrit.wikimedia.org/r/389485

Change 401491 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: partial eqiad SMART metrics rollout

https://gerrit.wikimedia.org/r/401491

Change 401491 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: partial eqiad SMART metrics rollout

https://gerrit.wikimedia.org/r/401491

Change 401707 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: special case for backports on Ubuntu

https://gerrit.wikimedia.org/r/401707

Change 401707 merged by Filippo Giunchedi:
[operations/puppet@production] smart: pin smartmontools to backports on Debian

https://gerrit.wikimedia.org/r/401707

Change 402023 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: bump timeout to 60s

https://gerrit.wikimedia.org/r/402023

Change 402024 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: ignore drbd disks

https://gerrit.wikimedia.org/r/402024

Change 402023 merged by Filippo Giunchedi:
[operations/puppet@production] smart: bump timeout to 60s

https://gerrit.wikimedia.org/r/402023

Change 402024 merged by Filippo Giunchedi:
[operations/puppet@production] smart: ignore drbd disks

https://gerrit.wikimedia.org/r/402024

Change 402056 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: extend eqiad SMART checking deployment

https://gerrit.wikimedia.org/r/402056

Change 402056 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: extend eqiad SMART checking deployment

https://gerrit.wikimedia.org/r/402056

Change 403621 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: extend SMART eqiad deployment

https://gerrit.wikimedia.org/r/403621

Change 403621 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: extend SMART eqiad deployment

https://gerrit.wikimedia.org/r/403621

Change 408543 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: extend SMART eqiad deployment

https://gerrit.wikimedia.org/r/408543

Change 408543 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: extend SMART eqiad deployment

https://gerrit.wikimedia.org/r/408543

Change 410412 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART for db in eqiad

https://gerrit.wikimedia.org/r/410412

Change 410413 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART for lab/labtest

https://gerrit.wikimedia.org/r/410413

Change 410412 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART for db in eqiad

https://gerrit.wikimedia.org/r/410412

Change 412656 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART on cp* in eqiad

https://gerrit.wikimedia.org/r/412656

Change 410413 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART for misc wikimedia.org

https://gerrit.wikimedia.org/r/410413

Change 412656 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART on cp* in eqiad

https://gerrit.wikimedia.org/r/412656

Change 412715 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART on bastions

https://gerrit.wikimedia.org/r/412715

Change 412716 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART on authdns

https://gerrit.wikimedia.org/r/412716

Change 412717 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART on recursor

https://gerrit.wikimedia.org/r/412717

Change 412715 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART on bastions

https://gerrit.wikimedia.org/r/412715

Change 412716 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART on authdns

https://gerrit.wikimedia.org/r/412716

Change 412717 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART on recursor

https://gerrit.wikimedia.org/r/412717

Change 412860 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable SMART for lab(test)

https://gerrit.wikimedia.org/r/412860

Change 412860 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable SMART for lab(test)

https://gerrit.wikimedia.org/r/412860

Change 422112 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: enable exporting SMART metrics by default

https://gerrit.wikimedia.org/r/422112

Change 422112 merged by Filippo Giunchedi:
[operations/puppet@production] base: enable exporting SMART metrics by default

https://gerrit.wikimedia.org/r/422112

Change 423871 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: fix apt::pin package definition

https://gerrit.wikimedia.org/r/423871

Change 423871 merged by Filippo Giunchedi:
[operations/puppet@production] smart: fix apt::pin package definition

https://gerrit.wikimedia.org/r/423871

Change 423874 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: exclude lvs100[1-6] with mpt controller from smart checks

https://gerrit.wikimedia.org/r/423874

Change 423874 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: exclude lvs100[1-6] with mpt controller from smart checks

https://gerrit.wikimedia.org/r/423874

fgiunchedi updated the task description. (Show Details)Apr 4 2018, 1:48 PM

The data collection is now deployed on bare metal across the fleet!

Alerting wise there's several metrics:

  • device_smart_healthy this is the result of self-test from smart itself, zero or one

And the usual suspects from https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

  • device_smart_reallocated_sector_ct
  • device_smart_spin_retry_count
  • device_smart_reported_uncorrect
  • device_smart_command_timeout
  • device_smart_current_pending_sector
  • device_smart_offline_uncorrectable

Change 424289 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: expand list of reported attributes

https://gerrit.wikimedia.org/r/424289

Change 424289 merged by Filippo Giunchedi:
[operations/puppet@production] smart: expand list of reported attributes

https://gerrit.wikimedia.org/r/424289

Change 427089 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] smart: normalize smartctl metric names into Prometheus names

https://gerrit.wikimedia.org/r/427089

Change 427089 merged by Filippo Giunchedi:
[operations/puppet@production] smart: normalize smartctl metric names into Prometheus names

https://gerrit.wikimedia.org/r/427089

Change 427654 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: alert on SMART health failure

https://gerrit.wikimedia.org/r/427654

Change 427654 merged by Filippo Giunchedi:
[operations/puppet@production] base: alert on SMART health failure

https://gerrit.wikimedia.org/r/427654

fgiunchedi updated the task description. (Show Details)May 7 2018, 12:15 PM
fgiunchedi moved this task from In progress to Up next on the observability board.Jul 23 2018, 3:06 PM
fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Aug 8 2018, 7:41 AM
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:11 PM

Change 507634 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] smart-data-dump: add '-l error' to facter command to suppress warnings

https://gerrit.wikimedia.org/r/507634

Change 507634 merged by Dzahn:
[operations/puppet@production] smart-data-dump: add '-l error' to facter command to suppress warnings

https://gerrit.wikimedia.org/r/507634

Dzahn added a comment.May 2 2019, 2:47 AM

Made a new ticket at T222326 describing our current issue with cron spam from this caused by a facter bug.

Change 507763 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] smart-data-dump: add '-l error' to facter command to suppress warnings

https://gerrit.wikimedia.org/r/507763

Change 507763 merged by Jbond:
[operations/puppet@production] smart-data-dump: add '-l error' to facter command to suppress warnings

https://gerrit.wikimedia.org/r/507763

fgiunchedi moved this task from Up next to Backlog on the User-fgiunchedi board.Oct 9 2019, 11:31 PM

@fgiunchedi: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task assignee. Thanks in advance! (You can change the task status via Add Action...Change Status in the dropdown menu.)

@fgiunchedi: Hi, all related patches in Gerrit have been merged or abandoned. Is there more to do in this task? Asking as you are set as task assignee. Thanks in advance! (You can change the task status via Add Action...Change Status in the dropdown menu.)

Indeed, this task is used as tracking for its children, anything to do to mark it as such ?

Aklapper renamed this task from Monitor and alarm on SMART attributes to Monitor and alarm on SMART attributes [tracking].Mar 2 2020, 9:40 AM
Aklapper added a project: Epic.

Ah, thanks. Let's call it an epic. :P

fgiunchedi moved this task from Up next to Backlog on the observability board.Mar 16 2020, 2:18 PM
Aklapper removed fgiunchedi as the assignee of this task.Jun 19 2020, 4:26 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)