Page MenuHomePhabricator

Page if the grid engine master is unreachable
Closed, DeclinedPublic

Description

This happened today (due to labservices issues) and icinga noticed but /I/ didn't notice until someone mentioned it on IRC.

Related Objects

Event Timeline

@Andrew What would be a command line to check if this is the case? Is it about a running process or actually connecting to it? Over which protocol?

Oh wait.. it's an existing Icinga check and this is only about changing the notification commands? That's easy enough.. taking it.

Who exactly should receive the pages please? I assume wmcs team.

Change 427833 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: enable paging and set contact_group for grid engine checks

https://gerrit.wikimedia.org/r/427833

@Andrew Are the "Auth DNS TCP" and "Auth DNS UDP" checks the ones this is about?

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=grid

I think my patch above will result in "all ops get SMS and additionally wmcs people get emails (or whatever their notification method is)" but it would not (yet) mean "SMS are sent but only to wmcs" people. Is the latter what you wanted?

Dzahn removed Dzahn as the assignee of this task.Apr 30 2018, 3:25 PM
Dzahn moved this task from Up next to Externally blocked on the observability board.

Change 427833 abandoned by Dzahn:
icinga: enable paging and set contact_group for grid engine checks

Reason:
probably not what was intended on ticket, but not sure

https://gerrit.wikimedia.org/r/427833

@Dzahn That's pdns and not the same thing. The service here would be the gridengine gridmaster services (unreachable would probably mean it's network is shot?)

I'd say we should be checking that port 6444 is available on tools-grid-master.tools.eqiad.wmflabs if we wanted to verify if it was specifically reachable. Otherwise, we could monitor for the /usr/lib/gridengine/sge_qmaster process if the problem was the process dying. (I lack the full context here).

@Bstorm Ok, thanks. I could add the check for port 6444 to Icinga. I note though that the original ticket says "icinga noticed" so i was trying to find which existing check that is. Then we just have to make that send SMS (instead of just email) to resolve this ticket.

Point! I can take a peek at that. I just jumped on this one randomly and was responding to comments. :)

Here's a question I have, how did Icinga notice? I don't think it monitors the VPS VMs. The grid master is a tools VM.

Maybe that is what the ticket intended!

@Bstorm on Icinga we have a couple of checks for Check for gridmaster host resolution {TCP,UDP}, could be those ones?

modules/profile/manifests/openstack/base/pdns/auth/monitor/host_check.pp:        description   => 'Check for gridmaster host resolution UDP',
modules/profile/manifests/openstack/base/pdns/auth/monitor/host_check.pp:        description   => 'Check for gridmaster host resolution TCP',
class profile::openstack::base::pdns::auth::monitor::host_check(
    $target_host = hiera('profile::openstack::base::pdns::host'),
    $target_fqdn = hiera('profile::openstack::base::pdns::monitor::target_fqdn'),
    ) {

    monitoring::service { "${target_host} Resolution":
        description   => 'Auth DNS',
        check_command => "check_dns!${target_host}",
    }

    monitoring::service { "${target_host} Auth DNS UDP":
        description   => 'Check for gridmaster host resolution UDP',
        check_command => "check_dig!${target_host}!${target_fqdn}",
    }

    monitoring::service { "${target_host} Auth DNS TCP":
        description   => 'Check for gridmaster host resolution TCP',
        check_command => "check_dig_tcp!${target_host}!${target_fqdn}",
    }
}

I'm not completely familiar with our network security policies but I believe icinga1001 can't resolve .wmflabs addresses and connections to 10.68.20.158:6444 aren't allowed either.

It seems this check would have to be done on... Shinken ("The Exorcist" soundtrack plays in the background).

@GTirloni The check in icinga works by using cloudcontrol1003 as the checking host.

@Andrew Are those the right things you wanted to alert on? It's just a DNS check, vs making sure the actual services are live, but the wording of the ticket suggests that's what this was. If so, we probably need to add the new grid master and then turn on the paging in general.

In theory there are tests that submit things to the grid via tools-checker
than ensure the gridmaster itself is functioning but previously we had
issues where DNS was faulty and that cascaded down IIRC so we put in some
service level checking there

@Bstorm yes, noticing and paging with the actual grid master VMs are down would be a good start. So probably that's just setting a flag and then copy/pasting for the additional grid.

This is a strange ticket. It seems like we have accrued all the pieces needed to find out if the grid is down now, but this one is specific and perhaps too historical to be as clearly resolved now as it may have been when created.