Determine first pass list of icinga-alerting data from graphite.wmflabs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	greg
	Aug 28 2014, 10:05 PM

Description

Let's get some icinga alerts so we know when things are going sideways in Beta Cluster.

Version: unspecified
Severity: normal

Details

Reference: bz70141

Related Objects
Search...

Status	Assigned	Task
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Stalled	None	T53497 Setup monitoring for Beta Cluster (tracking)
Resolved	yuvipanda	T72141 Determine first pass list of icinga-alerting data from graphite.wmflabs

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:34 AM

• bzimport added a project: Beta-Cluster-Infrastructure.

• bzimport set Reference to bz70141.

greg created this task.Aug 28 2014, 10:05 PM

No puppet run for more than 1h
Presence of any puppet failures

What else?

My first pass list (puppet fails on important vms):

deployment-prep.deployment-bastion.puppetagent.failed_events.value > 0
deployment-prep.deployment-mediawiki01.puppetagent.failed_events.value > 0
deployment-prep.deployment-mediawiki02.puppetagent.failed_events.value > 0

(In reply to Yuvi Panda from comment #1)

No puppet run for more than 1h

http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1409263881.662&target=deployment-prep.deployment-mediawiki01.puppetagent.time_since_last_run.value&yUnitSystem=si&from=00%3A00_20140731&until=23%3A59_20140828

eek! (that's one month) we're better now though

I just realized that you can't hit Labs URLs from prod, and so we can't actually do this right now because of that :(

Two options:

File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

deployment-prep.deployment-mediawiki01.diskspace.root.byte_free.value < 2 gigs
deployment-prep.deployment-mediawiki02.diskspace.root.byte_free.value < 2 gigs
deployment-prep.deployment-mediawiki01.diskspace._var.byte_free.value < 1 gig
deployment-prep.deployment-mediawiki02.diskspace._var.byte_free.value < 1 gig

(In reply to Yuvi Panda from comment #4)

Two options:

File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.

Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

(2) https://rt.wikimedia.org/Ticket/Display.html?id=8163

(In reply to Greg Grossmeier from comment #6)

(In reply to Yuvi Panda from comment #4)

Two options:

File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.

Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

(2) https://rt.wikimedia.org/Ticket/Display.html?id=8163

That RT is now done (thanks mark!). So now just waiting on labsmon1001 to be setup, I presume.

12:38 < YuviPanda> greg-g: labmon is setup - labmon.wmflabs.org :) Am sending metrics on to it now
12:38 < YuviPanda> I'll rename it to graphite.wmflabs.org soon

There now exists monitoring for puppet failures and disk space (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon). Puppet failures need to be tweaked further since they currently do not bail when puppet fails with a syntax error or something like that.

Note that the alert are for all the machines, in betalabs, not just for the ones listed. I added more features to our check_graphite script to make this kind of monitoring easy / possible.

Change 159694 had a related patch set uploaded by Yuvipanda:
labmon: Add low space check for / on betalabs

https://gerrit.wikimedia.org/r/159694

Change 159701 had a related patch set uploaded by Yuvipanda:
labmon: Add puppet freshness check for betalabs

https://gerrit.wikimedia.org/r/159701

Also, who is responsible for fixing the errors that pop up? There are puppet failures on videoscaler-01 now, and I've no idea how to fix those.

(On that note, I'd also remove myself from the alert groups once the initial setting up is stabilized)

Change 159694 merged by Andrew Bogott:
labmon: Add low space check for / on betalabs

https://gerrit.wikimedia.org/r/159694

Change 159701 merged by Andrew Bogott:
labmon: Add puppet freshness check for betalabs

https://gerrit.wikimedia.org/r/159701

Yuvi: Thanks for the first pass work! Once you remove yourself from the list of people who get the alerts, feel free to close this bug (the "first pass" of this is done).

(In reply to Greg Grossmeier from comment #17)

Yuvi: Thanks for the first pass work! Once you remove yourself from the list
of people who get the alerts, feel free to close this bug (the "first pass"
of this is done).

Done waiting, closing for housekeeping reasons :)

greg moved this task from To Triage to Done on the Beta-Cluster-Infrastructure board.Jan 8 2015, 5:34 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:58 PM

Determine first pass list of icinga-alerting data from graphite.wmflabsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Determine first pass list of icinga-alerting data from graphite.wmflabs
Closed, ResolvedPublic
Actions

Related Objects
Search...