Page MenuHomePhabricator

"Is maps service alive?" check
Closed, ResolvedPublic

Description

@Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service alive?" check (which is usually just a simple request) pointed at http://kartotherian.svc.codfw.wmnet:6533 (and ditto for eqiad once it's alive), as well as the public-side checks on https://maps.wikimedia.org, both of which should alert/page on downtime.

Other services defined here

Details

Related Gerrit Patches:
operations/puppet : productionTeam-interactive receives maps alerts
operations/puppet : productionKartotherian: Set up LVS checks
operations/puppet : productionRemove kartotherian from monitor_services
operations/puppet : productionmaps.wm.o monitoring
operations/puppet : productionkartotherian.svc.codfw monitoring

Event Timeline

Yurik created this task.Jun 14 2016, 10:32 PM
Restricted Application added a subscriber: Zppix. · View Herald TranscriptJun 14 2016, 10:32 PM

there's also some (slightly different) monitoring usually configured in hieradata/common/lvs/configuration.yaml, the "icinga:" sub-keys you see on services there. and then there's the cache-level one for public-facing URLs also in the same file the same basic way.

The spec.yaml uses https://maps.wikimedia.org/osm-intl/11/828/655.png for node testing. This is a very small water tile, but it is at high zoom (above 10), which means it is not actually stored in Cassandra, but generated on the fly. If we want a lower-execution-cost tile, we can use another water tile https://maps.wikimedia.org/osm-intl/6/23/24.png -- it is always stored as is in Cassandra, no need for calculations.

Change 294396 had a related patch set uploaded (by BBlack):
kartotherian.svc.codfw monitoring

https://gerrit.wikimedia.org/r/294396

Change 294397 had a related patch set uploaded (by BBlack):
maps.wm.o monitoring

https://gerrit.wikimedia.org/r/294397

note the first commit 294396 is codfw-only since eqiad isn't set up yet. needs identical eqiad-stuff once eqiad exists

Change 294396 merged by BBlack:
kartotherian.svc.codfw monitoring

https://gerrit.wikimedia.org/r/294396

Change 294397 merged by BBlack:
maps.wm.o monitoring

https://gerrit.wikimedia.org/r/294397

Basic checks are in place with the above merged. I'm not sure about the contact-groups stuff on the kartotherian.svc check, it uses 'services-team', is that appropriate for this?

Heh, relatedly, check_wmf_service probably isn't the right thing to use in general, as it tries to parse the response and doesn't like PNG :)

Change 294406 had a related patch set uploaded (by BBlack):
Remove kartotherian from monitor_services

https://gerrit.wikimedia.org/r/294406

Change 294406 merged by BBlack:
Remove kartotherian from monitor_services

https://gerrit.wikimedia.org/r/294406

Removed that, left in the actual LVS-level checks. Seems sane.

BBlack closed this task as Resolved.Jun 15 2016, 12:00 AM
BBlack claimed this task.
mobrovac reopened this task as Open.Jun 15 2016, 9:35 AM

Reopening for setting up full LVS checks.

Change 294454 had a related patch set uploaded (by Mobrovac):
Kartotherian: Set up LVS checks

https://gerrit.wikimedia.org/r/294454

Change 294454 merged by BBlack:
Kartotherian: Set up LVS checks

https://gerrit.wikimedia.org/r/294454

As noted in the commitmsg above, we should figure out icinga contactgroup stuff for this, too. Who is the correct team to get the alerts?

Gehel added a subscriber: Joe.Jun 16 2016, 3:05 PM

We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we do for other services, see:
https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/lvs/manifests/monitor_services.pp
and probably we should make it page too.
As far as paging goes, I'd stick with having public-facing services paging us (so just karthoterian on the LVS).

I'll also add team-interactive to the LVS alerts as they might be able to act on those alerts.

Change 294723 had a related patch set uploaded (by Gehel):
Team-interactive receives maps alerts

https://gerrit.wikimedia.org/r/294723

Change 294723 merged by Gehel:
Team-interactive receives maps alerts

https://gerrit.wikimedia.org/r/294723

Gehel added a comment.Jun 16 2016, 8:44 PM

Check implemented, also alerting team-interactive.

Gehel closed this task as Resolved.Jun 16 2016, 8:44 PM