Let's get some icinga alerts so we know when things are going sideways in Beta Cluster.
Version: unspecified
Severity: normal
Let's get some icinga alerts so we know when things are going sideways in Beta Cluster.
Version: unspecified
Severity: normal
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T53494 Use Beta cluster as a true canary for code deployments (epic) | |||
Stalled | None | T53497 Setup monitoring for Beta Cluster (tracking) | |||
Resolved | yuvipanda | T72141 Determine first pass list of icinga-alerting data from graphite.wmflabs |
My first pass list (puppet fails on important vms):
(In reply to Yuvi Panda from comment #1)
- No puppet run for more than 1h
eek! (that's one month) we're better now though
I just realized that you can't hit Labs URLs from prod, and so we can't actually do this right now because of that :(
Two options:
Unsure if Ops would be ok with (1), and (2) is blocked on the network config.
(In reply to Yuvi Panda from comment #4)
Two options:
- File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
- Wait for labmon1001 to be setup.
Unsure if Ops would be ok with (1), and (2) is blocked on the network config.
(In reply to Greg Grossmeier from comment #6)
(In reply to Yuvi Panda from comment #4)
Two options:
- File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
- Wait for labmon1001 to be setup.
Unsure if Ops would be ok with (1), and (2) is blocked on the network config.
That RT is now done (thanks mark!). So now just waiting on labsmon1001 to be setup, I presume.
12:38 < YuviPanda> greg-g: labmon is setup - labmon.wmflabs.org :) Am sending metrics on to it now
12:38 < YuviPanda> I'll rename it to graphite.wmflabs.org soon
There now exists monitoring for puppet failures and disk space (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon). Puppet failures need to be tweaked further since they currently do not bail when puppet fails with a syntax error or something like that.
Note that the alert are for all the machines, in betalabs, not just for the ones listed. I added more features to our check_graphite script to make this kind of monitoring easy / possible.
Change 159694 had a related patch set uploaded by Yuvipanda:
labmon: Add low space check for / on betalabs
Change 159701 had a related patch set uploaded by Yuvipanda:
labmon: Add puppet freshness check for betalabs
Also, who is responsible for fixing the errors that pop up? There are puppet failures on videoscaler-01 now, and I've no idea how to fix those.
(On that note, I'd also remove myself from the alert groups once the initial setting up is stabilized)
Change 159701 merged by Andrew Bogott:
labmon: Add puppet freshness check for betalabs
Yuvi: Thanks for the first pass work! Once you remove yourself from the list of people who get the alerts, feel free to close this bug (the "first pass" of this is done).
(In reply to Greg Grossmeier from comment #17)
Yuvi: Thanks for the first pass work! Once you remove yourself from the list
of people who get the alerts, feel free to close this bug (the "first pass"
of this is done).
Done waiting, closing for housekeeping reasons :)