Page MenuHomePhabricator

"Is maps service alive?" check
Closed, ResolvedPublic


@Yurik - T137617 does detailed service monitoring on each node (and probably shouldn't page people). What we're lacking is the higher-level "is this service alive?" check (which is usually just a simple request) pointed at http://kartotherian.svc.codfw.wmnet:6533 (and ditto for eqiad once it's alive), as well as the public-side checks on, both of which should alert/page on downtime.

Other services defined here

Event Timeline

there's also some (slightly different) monitoring usually configured in hieradata/common/lvs/configuration.yaml, the "icinga:" sub-keys you see on services there. and then there's the cache-level one for public-facing URLs also in the same file the same basic way.

The spec.yaml uses for node testing. This is a very small water tile, but it is at high zoom (above 10), which means it is not actually stored in Cassandra, but generated on the fly. If we want a lower-execution-cost tile, we can use another water tile -- it is always stored as is in Cassandra, no need for calculations.

Change 294396 had a related patch set uploaded (by BBlack):
kartotherian.svc.codfw monitoring

Change 294397 had a related patch set uploaded (by BBlack):
maps.wm.o monitoring

note the first commit 294396 is codfw-only since eqiad isn't set up yet. needs identical eqiad-stuff once eqiad exists

Change 294396 merged by BBlack:
kartotherian.svc.codfw monitoring

Basic checks are in place with the above merged. I'm not sure about the contact-groups stuff on the kartotherian.svc check, it uses 'services-team', is that appropriate for this?

Heh, relatedly, check_wmf_service probably isn't the right thing to use in general, as it tries to parse the response and doesn't like PNG :)

Change 294406 had a related patch set uploaded (by BBlack):
Remove kartotherian from monitor_services

Change 294406 merged by BBlack:
Remove kartotherian from monitor_services

Removed that, left in the actual LVS-level checks. Seems sane.

BBlack claimed this task.

Reopening for setting up full LVS checks.

Change 294454 had a related patch set uploaded (by Mobrovac):
Kartotherian: Set up LVS checks

Change 294454 merged by BBlack:
Kartotherian: Set up LVS checks

As noted in the commitmsg above, we should figure out icinga contactgroup stuff for this, too. Who is the correct team to get the alerts?

We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we do for other services, see:

and probably we should make it page too.

As far as paging goes, I'd stick with having public-facing services paging us (so just karthoterian on the LVS).

I'll also add team-interactive to the LVS alerts as they might be able to act on those alerts.

Change 294723 had a related patch set uploaded (by Gehel):
Team-interactive receives maps alerts

Change 294723 merged by Gehel:
Team-interactive receives maps alerts

Check implemented, also alerting team-interactive.