Page MenuHomePhabricator

Create alerts for bastion hosts - Usage and latency
Open, MediumPublic

Description

As we have latency troubles with tools-bastions (most common due to use of IO or CPU intensive tools running directly) it seems like a good idea to setup alerts for IO, CPU and (maybe?) ssh latency for these hosts.

  • CPU alerts
  • IO alerts
  • SSH latency alerts?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2018, 7:21 PM
chasemp triaged this task as Medium priority.Feb 5 2018, 7:27 PM

Note we already have ssh alerts for bastion hosts, but it currently uses a version check, latency to ssh connect might(?) allow us to better estimate when user are having a poor experience in the shell.

zhuyifei1999 added a subscriber: zhuyifei1999.

IO alerts

I think this might exist already? Not sure if it actually works though. When I notice a host having high NFS lag it's usually 'puppet staleness' errors instead.

I see we can use a check_graphite_series_threshold to get the loadavg like we are doing with iowait (from https://graphite-labs.wikimedia.org/ ). Do we have a number of cores variable that we could use in the config? Apparently total_cpu is not a thing in graphite right now.

Do we have a number of cores variable that we could use in the config? Apparently total_cpu is not a thing in graphite right now.

Yes, for jessie, but somehow not for trusty...

Change 413781 had a related patch set uploaded (by Chico Venancio; owner: Chico Venancio):
[operations/puppet@production] shinken: WMCS: add load alerts for tools-bastion-0[23]

https://gerrit.wikimedia.org/r/413781

Change 413781 abandoned by Rush:
shinken: WMCS: add load alerts for tools-bastion-0[23]

Reason:
too old to be effective I believe

https://gerrit.wikimedia.org/r/413781

Aklapper removed Chicocvenancio as the assignee of this task.Jun 19 2020, 4:19 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Bstorm added a subscriber: Bstorm.Oct 21 2020, 4:22 PM

At this point, this will likely be fairly simple to fulfill in the service established by T266050: Build Prometheus service for use by all Cloud VPS projects and their instances

I daresay some basics could be hacked in now, but it would likely break up the simplicity of the current class.