Page MenuHomePhabricator

Migrate Foundations Prometheus alerts to AlertManager
Open, MediumPublic

Description

Followup from a chat with @Volans, there are a few Prometheus-based alerts in Puppet that should be straightforward to migrate to alerting rules (i.e. alerts.git or puppet if for some reason we want to set alerts "administratively" since alerts.git is self-service even for non-root / non-sre).

Specifically:

modules/profile/manifests/monitoring.pp:        monitoring::check_prometheus { 'smart_healthy':
modules/profile/manifests/monitoring.pp:    monitoring::check_prometheus { 'edac_correctable_errors':
modules/profile/manifests/monitoring.pp:    monitoring::check_prometheus { 'edac_syslog_events':
modules/profile/manifests/monitoring.pp:    monitoring::check_prometheus { 'filesystem_avail_bigger_than_size':

These could be either still valid (e.g. smart_healthy) or could be possibly discarded as not relevant anymore (e.g. filesystem_avail_bigger_than_size)

Event Timeline

lmata triaged this task as Medium priority.Nov 16 2021, 4:45 PM

Change 757489 had a related patch set uploaded (by Volans; author: Volans):

[operations/alerts@master] [WIP] team-sre: add hardware-related checks

https://gerrit.wikimedia.org/r/757489

Change 757489 merged by jenkins-bot:

[operations/alerts@master] Add SmartNotHealthy to monitor for disk smart alerts

https://gerrit.wikimedia.org/r/757489

Change 757489 merged by jenkins-bot:

[operations/alerts@master] Add SmartNotHealthy to monitor for disk smart alerts

https://gerrit.wikimedia.org/r/757489

@jhathaway I saw the alert firing today, looks like it is working as expected so that's great! I believe the old icinga alert can be removed now (i.e. both are firing e.g. for aqs1007 ATM)

@jhathaway I saw the alert firing today, looks like it is working as expected so that's great! I believe the old icinga alert can be removed now (i.e. both are firing e.g. for aqs1007 ATM)

That is good news, however it appears that the two alerts are false positives. I dug into it a bit and it appears both drives are virtual drives exposed by the idrac:

root@aqs1007:~# lsblk -S /dev/{sdm,sdh}
NAME HCTL       TYPE VENDOR   MODEL             REV TRAN
sdh  10:0:0:0   disk iDRAC    MAS001           0329 usb
sdm  11:0:0:0   disk iDRAC    MAS025           0329 usb

I then went digging in the Prometheus smartmon script, https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/master/smartmon.py, only to discover that we wrote our own.

However neither theirs nor ours seems to provide a way to skip a certain vendor or model of a drive.

I think we should augment the script to query the vendor via /sys/, i.e. cat /sys/block/sdm/device/vendor in addition to or instead of querying via smartctl, so that we can exclude idrac devices.

@jhathaway I saw the alert firing today, looks like it is working as expected so that's great! I believe the old icinga alert can be removed now (i.e. both are firing e.g. for aqs1007 ATM)

That is good news, however it appears that the two alerts are false positives. I dug into it a bit and it appears both drives are virtual drives exposed by the idrac:

root@aqs1007:~# lsblk -S /dev/{sdm,sdh}
NAME HCTL       TYPE VENDOR   MODEL             REV TRAN
sdh  10:0:0:0   disk iDRAC    MAS001           0329 usb
sdm  11:0:0:0   disk iDRAC    MAS025           0329 usb

I then went digging in the Prometheus smartmon script, https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/master/smartmon.py, only to discover that we wrote our own.

However neither theirs nor ours seems to provide a way to skip a certain vendor or model of a drive.

I think we should augment the script to query the vendor via /sys/, i.e. cat /sys/block/sdm/device/vendor in addition to or instead of querying via smartctl, so that we can exclude idrac devices.

+1 to exclude list, thank you for digging out the root cause. Just as an historical/contextual note I can't remember at the minute why we went with our implementation of smartmon.py (it is possible it wasn't available at the time though). Having said that, nowadays it might make sense to move to upstream' smartmon (100% out of scope for this task, but putting it out there)

The other thing to note is that aqs1007 and its idrac devices seem definitely a mis-setup of idrac / unexpected situation (e.g. aqs1008 doesn't have the idrac disks) and the host is pending decom (according to T304938, I can't find the decom task itself).

Change 780990 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] smart_data_dump: skip over iDRAC devices

https://gerrit.wikimedia.org/r/780990

+1 to exclude list, thank you for digging out the root cause. Just as an historical/contextual note I can't remember at the minute why we went with our implementation of smartmon.py (it is possible it wasn't available at the time though). Having said that, nowadays it might make sense to move to upstream' smartmon (100% out of scope for this task, but putting it out there)

I added a patch to exclude iDRAC devices. It appears our code predates upstream's code and it looks like it would take a fair bit of work to the upstream code in order to reach feature parity with ours.

The other thing to note is that aqs1007 and its idrac devices seem definitely a mis-setup of idrac / unexpected situation (e.g. aqs1008 doesn't have the idrac disks) and the host is pending decom (according to T304938, I can't find the decom task itself).

Good point, however I suspect we may see these devices pop up from time to time.

Change 780990 merged by JHathaway:

[operations/puppet@production] smart_data_dump: skip over iDRAC devices

https://gerrit.wikimedia.org/r/780990

Change 785921 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] icinga: remove SMART check

https://gerrit.wikimedia.org/r/785921

Change 785921 merged by JHathaway:

[operations/puppet@production] icinga: remove SMART check

https://gerrit.wikimedia.org/r/785921