Page MenuHomePhabricator

Create alerts for bastion hosts - Usage and latency
Open, MediumPublic

Description

As we have latency troubles with tools-bastions (most common due to use of IO or CPU intensive tools running directly) it seems like a good idea to setup alerts for IO, CPU and (maybe?) ssh latency for these hosts.

  • CPU alerts
  • IO alerts
  • SSH latency alerts?

Event Timeline

chasemp triaged this task as Medium priority.Feb 5 2018, 7:27 PM

Note we already have ssh alerts for bastion hosts, but it currently uses a version check, latency to ssh connect might(?) allow us to better estimate when user are having a poor experience in the shell.

IO alerts

I think this might exist already? Not sure if it actually works though. When I notice a host having high NFS lag it's usually 'puppet staleness' errors instead.

I see we can use a check_graphite_series_threshold to get the loadavg like we are doing with iowait (from https://graphite-labs.wikimedia.org/ ). Do we have a number of cores variable that we could use in the config? Apparently total_cpu is not a thing in graphite right now.

Do we have a number of cores variable that we could use in the config? Apparently total_cpu is not a thing in graphite right now.

Yes, for jessie, but somehow not for trusty...

Change 413781 had a related patch set uploaded (by Chico Venancio; owner: Chico Venancio):
[operations/puppet@production] shinken: WMCS: add load alerts for tools-bastion-0[23]

https://gerrit.wikimedia.org/r/413781

Change 413781 abandoned by Rush:
shinken: WMCS: add load alerts for tools-bastion-0[23]

Reason:
too old to be effective I believe

https://gerrit.wikimedia.org/r/413781

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

At this point, this will likely be fairly simple to fulfill in the service established by T266050: Build Prometheus service for use by all Cloud VPS projects and their instances

I daresay some basics could be hacked in now, but it would likely break up the simplicity of the current class.