This happened today (due to labservices issues) and icinga noticed but /I/ didn't notice until someone mentioned it on IRC.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
icinga: enable paging and set contact_group for grid engine checks | operations/puppet | production | +4 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Bstorm | T199271 Upgrade the tools gridengine system | |||
Declined | None | T177850 Page if the grid engine master is unreachable |
Event Timeline
@Andrew What would be a command line to check if this is the case? Is it about a running process or actually connecting to it? Over which protocol?
Oh wait.. it's an existing Icinga check and this is only about changing the notification commands? That's easy enough.. taking it.
Change 427833 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: enable paging and set contact_group for grid engine checks
@Andrew Are the "Auth DNS TCP" and "Auth DNS UDP" checks the ones this is about?
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=grid
I think my patch above will result in "all ops get SMS and additionally wmcs people get emails (or whatever their notification method is)" but it would not (yet) mean "SMS are sent but only to wmcs" people. Is the latter what you wanted?
Change 427833 abandoned by Dzahn:
icinga: enable paging and set contact_group for grid engine checks
Reason:
probably not what was intended on ticket, but not sure
@Dzahn That's pdns and not the same thing. The service here would be the gridengine gridmaster services (unreachable would probably mean it's network is shot?)
I'd say we should be checking that port 6444 is available on tools-grid-master.tools.eqiad.wmflabs if we wanted to verify if it was specifically reachable. Otherwise, we could monitor for the /usr/lib/gridengine/sge_qmaster process if the problem was the process dying. (I lack the full context here).
@Bstorm Ok, thanks. I could add the check for port 6444 to Icinga. I note though that the original ticket says "icinga noticed" so i was trying to find which existing check that is. Then we just have to make that send SMS (instead of just email) to resolve this ticket.
Point! I can take a peek at that. I just jumped on this one randomly and was responding to comments. :)
Here's a question I have, how did Icinga notice? I don't think it monitors the VPS VMs. The grid master is a tools VM.
@Bstorm on Icinga we have a couple of checks for Check for gridmaster host resolution {TCP,UDP}, could be those ones?
modules/profile/manifests/openstack/base/pdns/auth/monitor/host_check.pp: description => 'Check for gridmaster host resolution UDP', modules/profile/manifests/openstack/base/pdns/auth/monitor/host_check.pp: description => 'Check for gridmaster host resolution TCP',
class profile::openstack::base::pdns::auth::monitor::host_check( $target_host = hiera('profile::openstack::base::pdns::host'), $target_fqdn = hiera('profile::openstack::base::pdns::monitor::target_fqdn'), ) { monitoring::service { "${target_host} Resolution": description => 'Auth DNS', check_command => "check_dns!${target_host}", } monitoring::service { "${target_host} Auth DNS UDP": description => 'Check for gridmaster host resolution UDP', check_command => "check_dig!${target_host}!${target_fqdn}", } monitoring::service { "${target_host} Auth DNS TCP": description => 'Check for gridmaster host resolution TCP', check_command => "check_dig_tcp!${target_host}!${target_fqdn}", } }
I'm not completely familiar with our network security policies but I believe icinga1001 can't resolve .wmflabs addresses and connections to 10.68.20.158:6444 aren't allowed either.
It seems this check would have to be done on... Shinken ("The Exorcist" soundtrack plays in the background).
Incidentially, we now have two grid masters :) So I suppose this should monitor tools-sgegrid-master as well.
@Andrew Are those the right things you wanted to alert on? It's just a DNS check, vs making sure the actual services are live, but the wording of the ticket suggests that's what this was. If so, we probably need to add the new grid master and then turn on the paging in general.
In theory there are tests that submit things to the grid via tools-checker
than ensure the gridmaster itself is functioning but previously we had
issues where DNS was faulty and that cascaded down IIRC so we put in some
service level checking there
@Bstorm yes, noticing and paging with the actual grid master VMs are down would be a good start. So probably that's just setting a flag and then copy/pasting for the additional grid.
This is a strange ticket. It seems like we have accrued all the pieces needed to find out if the grid is down now, but this one is specific and perhaps too historical to be as clearly resolved now as it may have been when created.