Page MenuHomePhabricator

Make primary DB masters page on HOST DOWN alert
Closed, ResolvedPublic

Description

At the moment, any host going down won't page, they will just send an IRC alert.

While this might be ok for the rest of the infra, if a primary database master goes down, that means that all the wikis on it will automatically go on read-only (apart from replication getting broken on the slaves).
In some cases, replication broken alerts can take up to 15 minutes to actually send an SMS - we should page for a master going down at it needs immediate action.

We should also discuss whether we want ALL databases to page if going down

Related incident: https://wikitech.wikimedia.org/wiki/Incident_documentation/2019-09-23_s3_primary_db_master_crash

Event Timeline

Marostegui triaged this task as Medium priority.Sep 24 2019, 5:55 AM
Marostegui added a project: Wikimedia-Incident.
Marostegui moved this task from Triage to Backlog on the DBA board.

There is some interaction between this and T252679 (although they are technically separate tickets). T252679 would solve this by not monitoring "Host X is down" but to change the logic into "Section X is in read only mode (probably because the primary server is down)" aka moving away from monitoring host and monitor abstract services instead.

This ticket is a short term solution, that one is more long term "model change". But I think it is useful to point it here for architectural considerations.

I don't know how difficult that is, but from what we've seen it is not easy to page on HOST DOWN for either masters and/or all databases. Any help would be much appreciated.

Pretty sure I already made this possible at some point in the past. The define monitoring::host does have a $critical = false parameter as well, like services. Based on that the $real_contact_groups is set to either just $contact_group or, if critical=true, to "${contact_group},sms,admins".

This additional "sms" contact groups does the trick. By social convention members of the so-called sms contact group have set their notification options to send email to that special mail2SMS gateway. (speaking in Icinga before VO and alertmanager world but they build on top of this afaict).

Now where is this used? It moved just recently from the "base" module/profile to this place: modules/profile/manifests/monitoring.pp. This is included by everything. Here we also already pass through critical => $is_critical to the monitoring::host define.

# @param is_critical indicate this host is critical

Boolean $is_critical = lookup('profile::monitoring::is_critical'),

So all you should have to do is use profile::monitoring::is_critical in Hiera and be good.

compare to the labweb (wikitech) example:

hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml:profile::monitoring::is_critical: true

You can actually check this by looking at alert1001:

[alert1001:/etc/icinga/objects] $ view puppet_hosts.cfg

27984     check_period                   24x7
27985     contact_groups                 admins,sms,admins
27986     host_name                      labweb1001

^ that and a select small number of other hosts have that "sms" contact group but almost all others do not.

Change 735689 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] mariadb/icinga: page people if sanitarium master goes down

https://gerrit.wikimedia.org/r/735689

Change 735695 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] icinga: use display_name for a HOST to add '#page' string where applicable

https://gerrit.wikimedia.org/r/735695

Change 736415 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Set core db host monitoring to critical.

https://gerrit.wikimedia.org/r/736415

Marostegui moved this task from Refine to In progress on the DBA board.
Marostegui added a subscriber: Kormat.

I am going to assign this to @Kormat as she's spending time on this.

Change 735689 abandoned by Dzahn:

[operations/puppet@production] mariadb/icinga: page people if sanitarium master goes down

Reason:

replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/736415

https://gerrit.wikimedia.org/r/735689

Change 736415 merged by Kormat:

[operations/puppet@production] mariadb: Set important db host monitoring to critical.

https://gerrit.wikimedia.org/r/736415

Resolving this, as https://gerrit.wikimedia.org/r/736415 is deployed, and communicated.

Change 735695 abandoned by Dzahn:

[operations/puppet@production] icinga: use display_name for a HOST to add 'page' string where applicable

Reason:

https://phabricator.wikimedia.org/T236379#7517716

https://gerrit.wikimedia.org/r/735695