Page MenuHomePhabricator

Create alerts for bastion hosts - Usage and latency
Open, NormalPublic

Description

As we have latency troubles with tools-bastions (most common due to use of IO or CPU intensive tools running directly) it seems like a good idea to setup alerts for IO, CPU and (maybe?) ssh latency for these hosts.

  • CPU alerts
  • IO alerts
  • SSH latency alerts?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2018, 7:21 PM
chasemp triaged this task as Normal priority.Feb 5 2018, 7:27 PM

Note we already have ssh alerts for bastion hosts, but it currently uses a version check, latency to ssh connect might(?) allow us to better estimate when user are having a poor experience in the shell.

zhuyifei1999 added a subscriber: zhuyifei1999.

IO alerts

I think this might exist already? Not sure if it actually works though. When I notice a host having high NFS lag it's usually 'puppet staleness' errors instead.

I see we can use a check_graphite_series_threshold to get the loadavg like we are doing with iowait (from https://graphite-labs.wikimedia.org/ ). Do we have a number of cores variable that we could use in the config? Apparently total_cpu is not a thing in graphite right now.

Do we have a number of cores variable that we could use in the config? Apparently total_cpu is not a thing in graphite right now.

Yes, for jessie, but somehow not for trusty...

Change 413781 had a related patch set uploaded (by Chico Venancio; owner: Chico Venancio):
[operations/puppet@production] shinken: WMCS: add load alerts for tools-bastion-0[23]

https://gerrit.wikimedia.org/r/413781