Page MenuHomePhabricator

Migrate Foundations Prometheus alerts to AlertManager
Closed, ResolvedPublic

Description

Followup from a chat with @Volans, there are a few Prometheus-based alerts in Puppet that should be straightforward to migrate to alerting rules (i.e. alerts.git or puppet if for some reason we want to set alerts "administratively" since alerts.git is self-service even for non-root / non-sre).

Specifically:

  • modules/profile/manifests/monitoring.pp: monitoring::check_prometheus { 'smart_healthy':
  • modules/profile/manifests/monitoring.pp: monitoring::check_prometheus { 'edac_correctable_errors':
  • modules/profile/manifests/monitoring.pp: monitoring::check_prometheus { 'edac_syslog_events':
  • modules/profile/manifests/monitoring.pp: monitoring::check_prometheus { 'filesystem_avail_bigger_than_size':
  • modules/profile/manifests/prometheus/alerts.pp: monitoring::check_prometheus { 'widespread-puppet-agent-fail':
  • modules/profile/manifests/prometheus/alerts.pp: monitoring::check_prometheus { 'widespread-puppet-agent-no-resources':

The EDAC checks will be dropped for some ipmi based monitoring

These could be either still valid (e.g. smart_healthy) or could be possibly discarded as not relevant anymore (e.g. filesystem_avail_bigger_than_size)

Event Timeline

lmata triaged this task as Medium priority.Nov 16 2021, 4:45 PM

Change 757489 had a related patch set uploaded (by Volans; author: Volans):

[operations/alerts@master] [WIP] team-sre: add hardware-related checks

https://gerrit.wikimedia.org/r/757489

Change 757489 merged by jenkins-bot:

[operations/alerts@master] Add SmartNotHealthy to monitor for disk smart alerts

https://gerrit.wikimedia.org/r/757489

Change 757489 merged by jenkins-bot:

[operations/alerts@master] Add SmartNotHealthy to monitor for disk smart alerts

https://gerrit.wikimedia.org/r/757489

@jhathaway I saw the alert firing today, looks like it is working as expected so that's great! I believe the old icinga alert can be removed now (i.e. both are firing e.g. for aqs1007 ATM)

@jhathaway I saw the alert firing today, looks like it is working as expected so that's great! I believe the old icinga alert can be removed now (i.e. both are firing e.g. for aqs1007 ATM)

That is good news, however it appears that the two alerts are false positives. I dug into it a bit and it appears both drives are virtual drives exposed by the idrac:

root@aqs1007:~# lsblk -S /dev/{sdm,sdh}
NAME HCTL       TYPE VENDOR   MODEL             REV TRAN
sdh  10:0:0:0   disk iDRAC    MAS001           0329 usb
sdm  11:0:0:0   disk iDRAC    MAS025           0329 usb

I then went digging in the Prometheus smartmon script, https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/master/smartmon.py, only to discover that we wrote our own.

However neither theirs nor ours seems to provide a way to skip a certain vendor or model of a drive.

I think we should augment the script to query the vendor via /sys/, i.e. cat /sys/block/sdm/device/vendor in addition to or instead of querying via smartctl, so that we can exclude idrac devices.

@jhathaway I saw the alert firing today, looks like it is working as expected so that's great! I believe the old icinga alert can be removed now (i.e. both are firing e.g. for aqs1007 ATM)

That is good news, however it appears that the two alerts are false positives. I dug into it a bit and it appears both drives are virtual drives exposed by the idrac:

root@aqs1007:~# lsblk -S /dev/{sdm,sdh}
NAME HCTL       TYPE VENDOR   MODEL             REV TRAN
sdh  10:0:0:0   disk iDRAC    MAS001           0329 usb
sdm  11:0:0:0   disk iDRAC    MAS025           0329 usb

I then went digging in the Prometheus smartmon script, https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/blob/master/smartmon.py, only to discover that we wrote our own.

However neither theirs nor ours seems to provide a way to skip a certain vendor or model of a drive.

I think we should augment the script to query the vendor via /sys/, i.e. cat /sys/block/sdm/device/vendor in addition to or instead of querying via smartctl, so that we can exclude idrac devices.

+1 to exclude list, thank you for digging out the root cause. Just as an historical/contextual note I can't remember at the minute why we went with our implementation of smartmon.py (it is possible it wasn't available at the time though). Having said that, nowadays it might make sense to move to upstream' smartmon (100% out of scope for this task, but putting it out there)

The other thing to note is that aqs1007 and its idrac devices seem definitely a mis-setup of idrac / unexpected situation (e.g. aqs1008 doesn't have the idrac disks) and the host is pending decom (according to T304938, I can't find the decom task itself).

Change 780990 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] smart_data_dump: skip over iDRAC devices

https://gerrit.wikimedia.org/r/780990

+1 to exclude list, thank you for digging out the root cause. Just as an historical/contextual note I can't remember at the minute why we went with our implementation of smartmon.py (it is possible it wasn't available at the time though). Having said that, nowadays it might make sense to move to upstream' smartmon (100% out of scope for this task, but putting it out there)

I added a patch to exclude iDRAC devices. It appears our code predates upstream's code and it looks like it would take a fair bit of work to the upstream code in order to reach feature parity with ours.

The other thing to note is that aqs1007 and its idrac devices seem definitely a mis-setup of idrac / unexpected situation (e.g. aqs1008 doesn't have the idrac disks) and the host is pending decom (according to T304938, I can't find the decom task itself).

Good point, however I suspect we may see these devices pop up from time to time.

Change 780990 merged by JHathaway:

[operations/puppet@production] smart_data_dump: skip over iDRAC devices

https://gerrit.wikimedia.org/r/780990

Change 785921 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] icinga: remove SMART check

https://gerrit.wikimedia.org/r/785921

Change 785921 merged by JHathaway:

[operations/puppet@production] icinga: remove SMART check

https://gerrit.wikimedia.org/r/785921

I'm auditing puppet.git for candidate Prometheus-based alerts and realized the list in the task description was missing the puppet-related alerts (added now)

Change 901546 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:monitoring: add docs and fix liniting errors

https://gerrit.wikimedia.org/r/901546

Change 901546 merged by Jbond:

[operations/puppet@production] P:monitoring: add docs and fix liniting errors

https://gerrit.wikimedia.org/r/901546

Change 901557 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] hardware: add Memory correctable errors (EDAC) alert

https://gerrit.wikimedia.org/r/901557

Change 901557 abandoned by Jbond:

[operations/alerts@master] hardware: add Memory correctable errors (EDAC) alert

Reason:

will create a new test based on impi data T302639

https://gerrit.wikimedia.org/r/901557

Change 902110 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:monitoring: drop check for filesystem_avail_bigger_than_size

https://gerrit.wikimedia.org/r/902110

Change 902110 merged by Jbond:

[operations/puppet@production] P:monitoring: drop check for filesystem_avail_bigger_than_size

https://gerrit.wikimedia.org/r/902110

jbond changed the task status from Open to In Progress.Mar 23 2023, 10:52 AM
jbond claimed this task.

Change 902323 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/puppet-agent: Add widespread puppet failure alert

https://gerrit.wikimedia.org/r/902323

Change 902348 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/puppet-agent: Add widespread puppet failure (no resources) alert

https://gerrit.wikimedia.org/r/902348

changes pending for the two widespread-puppet checks

Change 902323 merged by jenkins-bot:

[operations/alerts@master] team-sre/puppet-agent: Add widespread puppet failure alert

https://gerrit.wikimedia.org/r/902323

Change 902348 merged by Jbond:

[operations/alerts@master] team-sre/puppet-agent: Add widespread puppet failure (no resources) alert

https://gerrit.wikimedia.org/r/902348

Change 903687 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: introduce cluster vs site wide puppet failures

https://gerrit.wikimedia.org/r/903687

Change 903687 merged by Jbond:

[operations/alerts@master] sre: introduce cluster vs site wide puppet failures

https://gerrit.wikimedia.org/r/903687

Change 904182 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove edac alerts

https://gerrit.wikimedia.org/r/904182

Change 904182 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove edac alerts

https://gerrit.wikimedia.org/r/904182