Page MenuHomePhabricator

Review alerting scheme for Maps
Closed, ResolvedPublic

Description

Before calling maps production ready, we need to review alerting scheme and ensure we page the right people when something goes wrong and we don't wake up anyone if things are not that bad.

Summary of current alerts:

Critical services
Those services are directly used to serve content to users

ServiceCheck descriptionContact groups
cassandraTCP check on Cassandra portadmins,team-services
cassandraService checkadmins
kartotherianService checker (Swagger based)admins
maps5xx rateadmins
maps-lbHTTP checks on HTTP, HTTPS, IPv4 and IPv6 for each DCadmins,sms,admins
varnishstandard varnish checksadmins
kartotherian LVSLVS checkadmins,sms,admins

Non critical services
Those services are use for tile generation, they have no direct user impact

ServiceCheck descriptionContact groups
tileratorHTTP checkadmins
tileratoruiHTTP checkadmins
postgresno check yet

Event Timeline

We should probably create a maps team with @Yurik, @MaxSem for kartotherian and tilerator alerts.

We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we do for other services, see:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/lvs/manifests/monitor_services.pp

and probably we should make it page too.

As far as paging goes, I'd stick with having public-facing services paging us (so just karthoterian on the LVS).

Change 294498 had a related patch set uploaded (by Gehel):
Adding a "interactive-team" icinga group for alerting.

https://gerrit.wikimedia.org/r/294498

LVS endpoint are checked. The check_http_lvs_on_port check does page.

Change 294503 had a related patch set uploaded (by Gehel):
Add interactive-team to default Icinga notification group for maps servers

https://gerrit.wikimedia.org/r/294503

Change 294507 had a related patch set uploaded (by Gehel):
Add the ability to configure contact group for check of services.

https://gerrit.wikimedia.org/r/294507

@Gehel, what does admins,sms,admins mean? I think the first 4 items are relevant to @MaxSem and myself - Cassandra not working, Kartotherian is down, or too many 5xx

@Yurik admins,sms,admins are the groups to which the alerts are sent (I have no idea why admins is there twice). admins is the ops group, sms is the pagers.

I'll check how to add you to those alerts.

Change 294498 merged by Gehel:
Adding a "interactive-team" icinga group for alerting.

https://gerrit.wikimedia.org/r/294498

Change 294676 had a related patch set uploaded (by Gehel):
Interactive team would like to be notified of issues with Maps.

https://gerrit.wikimedia.org/r/294676

Change 294676 merged by Gehel:
Interactive team would like to be notified of issues with Maps.

https://gerrit.wikimedia.org/r/294676

Change 294507 merged by Gehel:
Add the ability to configure contact group for check of services.

https://gerrit.wikimedia.org/r/294507

We should add a service check for karthoterian using service_checker on the lvs IP, pretty much as we do for other services, see:

https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/lvs/manifests/monitor_services.pp

and probably we should make it page too.

As far as paging goes, I'd stick with having public-facing services paging us (so just karthoterian on the LVS).

That suggestion looks good to me. I'll add as well team-interactive to those alerts.

Change 294723 had a related patch set uploaded (by Gehel):
Team-interactive receives maps alerts

https://gerrit.wikimedia.org/r/294723

In term of production support, we seem to be good to go once https://gerrit.wikimedia.org/r/#/c/294723/ is merged. LVS will be paging.

We can do some more improvement in term of monitoring / alerting (monitor better non critical services like Postgres), but that is outside of the scope of this ticket. See T135647 instead.

Change 294503 abandoned by Gehel:
Add interactive-team to default Icinga notification group for maps servers

Reason:
replaced by https://gerrit.wikimedia.org/r/#/c/294723/ which give more finer grain alerting to team-interactive

https://gerrit.wikimedia.org/r/294503

Change 294723 merged by Gehel:
Team-interactive receives maps alerts

https://gerrit.wikimedia.org/r/294723