Page if the grid engine master is unreachable
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Andrew
	Oct 10 2017, 3:35 PM

Description

This happened today (due to labservices issues) and icinga noticed but /I/ didn't notice until someone mentioned it on IRC.

Details

	Subject	Repo	Branch	Lines +/-
	icinga: enable paging and set contact_group for grid engine checks	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Bstorm	T199271 Upgrade the tools gridengine system
		Declined		None	T177850 Page if the grid engine master is unreachable

Event Timeline

Andrew created this task.Oct 10 2017, 3:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 10 2017, 3:35 PM

bd808 edited projects, added cloud-services-team (Kanban), Toolforge, observability; removed cloud-services-team.Oct 12 2017, 8:13 PM

• chasemp added a parent task: T178405: create a wmcs alerting group in icinga and review alerting.Mar 13 2018, 1:45 PM

@Andrew What would be a command line to check if this is the case? Is it about a running process or actually connecting to it? Over which protocol?

Oh wait.. it's an existing Icinga check and this is only about changing the notification commands? That's easy enough.. taking it.

Who exactly should receive the pages please? I assume wmcs team.

Dzahn moved this task from Inbox to Up next on the observability board.Apr 3 2018, 5:26 PM

Change 427833 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: enable paging and set contact_group for grid engine checks

https://gerrit.wikimedia.org/r/427833

gerritbot added a project: Patch-For-Review.Apr 20 2018, 12:12 AM

@Andrew Are the "Auth DNS TCP" and "Auth DNS UDP" checks the ones this is about?

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=grid

I think my patch above will result in "all ops get SMS and additionally wmcs people get emails (or whatever their notification method is)" but it would not (yet) mean "SMS are sent but only to wmcs" people. Is the latter what you wanted?

• chasemp removed a parent task: T178405: create a wmcs alerting group in icinga and review alerting.Apr 26 2018, 7:34 PM

Dzahn removed Dzahn as the assignee of this task.Apr 30 2018, 3:25 PM

Dzahn moved this task from Up next to Externally blocked on the observability board.

Change 427833 abandoned by Dzahn:
icinga: enable paging and set contact_group for grid engine checks

Reason:
probably not what was intended on ticket, but not sure

https://gerrit.wikimedia.org/r/427833

@Dzahn That's pdns and not the same thing. The service here would be the gridengine gridmaster services (unreachable would probably mean it's network is shot?)

I'd say we should be checking that port 6444 is available on tools-grid-master.tools.eqiad.wmflabs if we wanted to verify if it was specifically reachable. Otherwise, we could monitor for the /usr/lib/gridengine/sge_qmaster process if the problem was the process dying. (I lack the full context here).

• Bstorm added a parent task: T199271: Upgrade the tools gridengine system.Jul 11 2018, 4:51 PM

@Bstorm Ok, thanks. I could add the check for port 6444 to Icinga. I note though that the original ticket says "icinga noticed" so i was trying to find which existing check that is. Then we just have to make that send SMS (instead of just email) to resolve this ticket.

Point! I can take a peek at that. I just jumped on this one randomly and was responding to comments. :)

Here's a question I have, how did Icinga notice? I don't think it monitors the VPS VMs. The grid master is a tools VM.

Maybe that is what the ticket intended!

@Bstorm on Icinga we have a couple of checks for Check for gridmaster host resolution {TCP,UDP}, could be those ones?

modules/profile/manifests/openstack/base/pdns/auth/monitor/host_check.pp:        description   => 'Check for gridmaster host resolution UDP',
modules/profile/manifests/openstack/base/pdns/auth/monitor/host_check.pp:        description   => 'Check for gridmaster host resolution TCP',

class profile::openstack::base::pdns::auth::monitor::host_check(
    $target_host = hiera('profile::openstack::base::pdns::host'),
    $target_fqdn = hiera('profile::openstack::base::pdns::monitor::target_fqdn'),
    ) {

    monitoring::service { "${target_host} Resolution":
        description   => 'Auth DNS',
        check_command => "check_dns!${target_host}",
    }

    monitoring::service { "${target_host} Auth DNS UDP":
        description   => 'Check for gridmaster host resolution UDP',
        check_command => "check_dig!${target_host}!${target_fqdn}",
    }

    monitoring::service { "${target_host} Auth DNS TCP":
        description   => 'Check for gridmaster host resolution TCP',
        check_command => "check_dig_tcp!${target_host}!${target_fqdn}",
    }
}

I'm not completely familiar with our network security policies but I believe icinga1001 can't resolve .wmflabs addresses and connections to 10.68.20.158:6444 aren't allowed either.

It seems this check would have to be done on... Shinken ("The Exorcist" soundtrack plays in the background).

• GTirloni unsubscribed.Dec 20 2018, 6:50 PM

@GTirloni The check in icinga works by using cloudcontrol1003 as the checking host.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudservices1003&service=Check+for+gridmaster+host+resolution+TCP

Incidentially, we now have two grid masters :) So I suppose this should monitor tools-sgegrid-master as well.

@Andrew Are those the right things you wanted to alert on? It's just a DNS check, vs making sure the actual services are live, but the wording of the ticket suggests that's what this was. If so, we probably need to add the new grid master and then turn on the paging in general.

In theory there are tests that submit things to the grid via tools-checker
than ensure the gridmaster itself is functioning but previously we had
issues where DNS was faulty and that cascaded down IIRC so we put in some
service level checking there

@Bstorm yes, noticing and paging with the actual grid master VMs are down would be a good start. So probably that's just setting a flag and then copy/pasting for the additional grid.

• GTirloni unsubscribed.Mar 21 2019, 9:06 PM

Andrew moved this task from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.Sep 11 2019, 3:17 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 11 2019, 4:10 PM

This is a strange ticket. It seems like we have accrued all the pieces needed to find out if the grid is down now, but this one is specific and perhaps too historical to be as clearly resolved now as it may have been when created.

Page if the grid engine master is unreachableClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Page if the grid engine master is unreachable
Closed, DeclinedPublic
Actions

Related Objects
Search...