Page MenuHomePhabricator

Port base host checks from Icinga to Alertmanager
Open, Needs TriagePublic

Description

This task tracks the porting of "base" (i.e. common to all hosts) checks from Icinga to Alertmanager.

There's two basic strategies:

  • The check's logic is simple, we can drop a Prometheus node-exporter metric file onto the file system and run the check periodically
  • The check's logic is not that simple, in this case we can consider things like https://github.com/canonical/nrpe_exporter

Theses are the current checks in profile::monitoring

disk_space:

dpkg

  • promethues check
  • removed icinga check

puppet_checkpuppetrun

check_eth - T333007

  • promethues check
  • removed icinga check

check_systemd_state

check_cpufreq - T163220#8725482

  • promethues check
  • removed icinga check

edac - T302639

  • promethues check
  • removed icinga check

ipmi::monitor

check_dhclient -

Event Timeline

Change 902019 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: cosmetic-only changes to check_dpkg

https://gerrit.wikimedia.org/r/902019

Change 902020 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: write node-exporter dpkg_success metric

https://gerrit.wikimedia.org/r/902020

Change 902021 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: simplify check_dpkg

https://gerrit.wikimedia.org/r/902021

Change 902019 merged by Filippo Giunchedi:

[operations/puppet@production] monitoring: cosmetic-only changes to check_dpkg

https://gerrit.wikimedia.org/r/902019

Change 902020 merged by Filippo Giunchedi:

[operations/puppet@production] monitoring: write node-exporter dpkg_success metric

https://gerrit.wikimedia.org/r/902020

Change 902021 merged by Filippo Giunchedi:

[operations/puppet@production] monitoring: simplify check_dpkg

https://gerrit.wikimedia.org/r/902021

Change 902457 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/hardware: Add disk space

https://gerrit.wikimedia.org/r/902457

Change 902701 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/systemd: add Check systemd state rule

https://gerrit.wikimedia.org/r/902701

jbond updated the task description. (Show Details)

Change 902754 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor

https://gerrit.wikimedia.org/r/902754

Change 902763 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] base::standard_packages: remove isc-dhcp-client

https://gerrit.wikimedia.org/r/902763

Change 902764 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/puppet-agent: Add alertmanager based check for disabled puppet

https://gerrit.wikimedia.org/r/902764

@fgiunchedi I was trying to work out how to do the prediction of space left.

I belive based on some tinkering that the below will give us the expected free space in 24 hours from now, based on the growth over the past week, expressed as a percentage:

expr: |
  (predict_linear
    (node_filesystem_avail_bytes{
      fstype!~"(tmpfs|rpc_pipefs|debugfs|tracefs|fuse|docker|kubelet)",
      mountpoint!~"/srv/(sd[a-b][1-3]|nvme[0-9]n[0-9]p[0-9])"
    }[1w], 24 * 3600) 
    / node_filesystem_size_bytes{
       fstype!~"(tmpfs|rpc_pipefs|debugfs|tracefs|fuse|docker|kubelet)",
       mountpoint!~"/srv/(sd[a-b][1-3]|nvme[0-9]n[0-9]p[0-9])"
      }) * 100 < 0.05

I didn't really find great examples to look at of systems with steadily increasing disk usage, or that varied enough to really analyze it deeply. Those varying stats I did look at the prediction seemed to make sense. I saved the dashboard I was playing about with:

https://grafana-rw.wikimedia.org/d/_oYF7pB4k/cathal-disk-space-prediction

Change 902457 merged by Jbond:

[operations/alerts@master] team-sre/resource: Add disk space

https://gerrit.wikimedia.org/r/902457

@fgiunchedi I was trying to work out how to do the prediction of space left.

I belive based on some tinkering that the below will give us the expected free space in 24 hours from now, based on the growth over the past week, expressed as a percentage:

expr: |
  (predict_linear
    (node_filesystem_avail_bytes{
      fstype!~"(tmpfs|rpc_pipefs|debugfs|tracefs|fuse|docker|kubelet)",
      mountpoint!~"/srv/(sd[a-b][1-3]|nvme[0-9]n[0-9]p[0-9])"
    }[1w], 24 * 3600) 
    / node_filesystem_size_bytes{
       fstype!~"(tmpfs|rpc_pipefs|debugfs|tracefs|fuse|docker|kubelet)",
       mountpoint!~"/srv/(sd[a-b][1-3]|nvme[0-9]n[0-9]p[0-9])"
      }) * 100 < 0.05

I didn't really find great examples to look at of systems with steadily increasing disk usage, or that varied enough to really analyze it deeply. Those varying stats I did look at the prediction seemed to make sense. I saved the dashboard I was playing about with:

https://grafana-rw.wikimedia.org/d/_oYF7pB4k/cathal-disk-space-prediction

This looks great -- thank you for the investigation! Did you run into any obvious mis-prediction so far? I'm happy to deploy the alert as a warning or similar and see what kind of signal to noise ratio we get

Change 902754 merged by jenkins-bot:

[operations/alerts@master] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor

https://gerrit.wikimedia.org/r/902754

Change 902701 merged by jenkins-bot:

[operations/alerts@master] team-sre/systemd: add Check systemd state rule

https://gerrit.wikimedia.org/r/902701

Change 903613 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: temp downgrade systemdunitfailed to warning, exclude wmcs

https://gerrit.wikimedia.org/r/903613

Change 903613 merged by Filippo Giunchedi:

[operations/alerts@master] sre: temp downgrade systemdunitfailed to warning, exclude wmcs

https://gerrit.wikimedia.org/r/903613

Did you run into any obvious mis-prediction so far?

Yes! I was toying with the dashboard to try to get a sense of what works.

The prediction works fine if you play with it in Grafana. If we've a disk growing at X rate steadily then the prediction is spot on. i.e. if the disk has grown by 1G in the past 4 hours, you set the 'lookback' for the prediction to 4 hours ago, and ask it to predict the usage 4 hours from now, it'll predict ~1G.

The problem I had was finding hosts where we have that kind of linear growth. Some of them I found do, but they typically have a "sawtooth" pattern, whereby the disk grows steadily, hits some upper level and drops (think logrotate running or something). If the time period used to make the prediction extends over one of these big jumps up/down the prediction is harder to understand.

Many of our hosts have even less regular patterns than that, so I'm unsure of exactly what parameters to use when deploying it.

Some hazy ideas about what might work are:

  • Have the alert look back in time (range value in query) only as far back as the usage counter has been incrementing. i.e. if the usage went down from 2G to 1G 3 days ago and has been rising since we look back over those 3 days of growth.
    • The upper bound of how far in the future we are predicting should be limited to this too.
  • Supress the alert if the total usage is not 5-10% higher than it's been at any other point in the last X weeks (try to avoid alerting if there is a regular increase and then garbage collection/reduction)

I'm happy to discuss if you wish, it is an interesting thing. But while it works well the problem is it's based on a linear growth pattern, and where we don't have that we need to add additional logic or it'll likely be off.

Change 904675 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add check for inodes free

https://gerrit.wikimedia.org/r/904675

Did you run into any obvious mis-prediction so far?

Yes! I was toying with the dashboard to try to get a sense of what works.

The prediction works fine if you play with it in Grafana. If we've a disk growing at X rate steadily then the prediction is spot on. i.e. if the disk has grown by 1G in the past 4 hours, you set the 'lookback' for the prediction to 4 hours ago, and ask it to predict the usage 4 hours from now, it'll predict ~1G.

The problem I had was finding hosts where we have that kind of linear growth. Some of them I found do, but they typically have a "sawtooth" pattern, whereby the disk grows steadily, hits some upper level and drops (think logrotate running or something). If the time period used to make the prediction extends over one of these big jumps up/down the prediction is harder to understand.

Many of our hosts have even less regular patterns than that, so I'm unsure of exactly what parameters to use when deploying it.

Some hazy ideas about what might work are:

  • Have the alert look back in time (range value in query) only as far back as the usage counter has been incrementing. i.e. if the usage went down from 2G to 1G 3 days ago and has been rising since we look back over those 3 days of growth.
    • The upper bound of how far in the future we are predicting should be limited to this too.
  • Supress the alert if the total usage is not 5-10% higher than it's been at any other point in the last X weeks (try to avoid alerting if there is a regular increase and then garbage collection/reduction)

I'm happy to discuss if you wish, it is an interesting thing. But while it works well the problem is it's based on a linear growth pattern, and where we don't have that we need to add additional logic or it'll likely be off.

Thank you for diving in deep into the problem/investigation. Good point re: sawtooth pattern, I didn't think about that! I'm also happy to discuss together and see if we can come up with a good signal to noise ratio based on predictions! Timing wise I think we (o11y) are likely to revisit this idea further in the next fiscal

Timing wise I think we (o11y) are likely to revisit this idea further in the next fiscal

I think we (data persistence) have this issue- backups sometimes take multiple days, and they are purged only every some time, leading to patterns like this.

Last time I tried to provide a good prediction I found that it is _really_ hard- for example, most dbs are provisioned from 0 to 50% capacity very quickly- leading to potential predictions that will run out of space in a few hours- but we know that is not going to happen because provisioning and normal operation are different modes, and there is not an easy way to differentiate them. Hopefully you have some better luck than I did when I started thinking about this issue.

I would start by creating dashboards rather than alerts just in case.

Change 904792 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: let check_dpkg write prometheus stats

https://gerrit.wikimedia.org/r/904792

Change 904795 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add some leeway in PowerSupply alert

https://gerrit.wikimedia.org/r/904795

Change 904795 merged by jenkins-bot:

[operations/alerts@master] sre: add some leeway in PowerSupply alert

https://gerrit.wikimedia.org/r/904795

Change 904792 merged by Filippo Giunchedi:

[operations/puppet@production] profile: let check_dpkg write prometheus stats

https://gerrit.wikimedia.org/r/904792

Change 902763 merged by Jbond:

[operations/puppet@production] base::standard_packages: remove isc-dhcp-client

https://gerrit.wikimedia.org/r/902763

Change 914726 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] ipmi: remove check_ipmi_sensor, moved to Prometheus

https://gerrit.wikimedia.org/r/914726

Change 914726 merged by Filippo Giunchedi:

[operations/puppet@production] ipmi: remove check_ipmi_sensor, moved to Prometheus

https://gerrit.wikimedia.org/r/914726

Change 902764 merged by Jbond:

[operations/alerts@master] team-sre/puppet-agent: Add alertmanager based check for disabled puppet

https://gerrit.wikimedia.org/r/902764

CDanis added a subscriber: CDanis.

I've x'd out a few of the ones that seem to have been fixed with an alertmanager rule long since added

Per my comment at T163220#9120653 I suggest simply removing the cpufreq alerting entirely.