Page MenuHomePhabricator

Review paging for WMCS systems
Closed, ResolvedPublic0 Estimated Story Points

Description

Since we recently clarified the paging structure for prod icinga, we need to review our existing monitoring configurations to ensure they are set up to page the folks we intend to page.

This goes for basically every host we run.
Anything with critical => true means it will page all the entire production team (regardless of the specified group), and that's effectively all it means. The default for all defined monitors is the "admin" group, which appears largely redundant with the "sms" group, which really should be "sre" or "production-roots". If we do not intend to page that entire team on any given alert, it should not be "critical" and it should have contact_group => 'wmcs-team' listed. This will generate an SMS page for our team only.

Since paging methods are entirely determined on the user contact definition, I also would propose we create a <username>-email contact that only has an email, not a phone number/sms gateway for each team member for things we would only like to remain informed about (such as most production alerts that aren't for cloud systems).

Event Timeline

Bstorm created this task.

First action item for this is creating email-only users for the team, and I've volunteered to do that.
Then we are going to review everything that pages and remove "critical" and add our team.

Ok, the team each has an email-only user configuration now in icinga. The name is <icinganame>-email

Change 528533 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] icinga: add wmcs-team-email for email-only alerts

https://gerrit.wikimedia.org/r/528533

Change 528535 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] icinga: switch up bstorm user a bit

https://gerrit.wikimedia.org/r/528535

Change 528535 merged by Bstorm:
[operations/puppet@production] icinga: switch up bstorm user a bit

https://gerrit.wikimedia.org/r/528535

Change 528533 merged by Bstorm:
[operations/puppet@production] icinga: add wmcs-team-email for email-only alerts

https://gerrit.wikimedia.org/r/528533

Change 528581 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] icinga: Set the WMCS host alerts to go only to WMCS

https://gerrit.wikimedia.org/r/528581

Change 528581 merged by Bstorm:
[operations/puppet@production] icinga: Set the WMCS host alerts to go only to WMCS

https://gerrit.wikimedia.org/r/528581

Change 528898 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] monitoring: change WMCS services to paging WMCS only

https://gerrit.wikimedia.org/r/528898

NOTE: We should tackle passing a parameter through the puppetmaster class hierarchy to make it so it sends alerts to the email-only group for the unmerged changes alert. Otherwise, that will always be annoying when someone is slow about things.

One thing to be mindful of here: admins = the operations channel feed. There may be instances where we do want to add that where we have not. Also, we could switch some hosts (like the puppetmasters) to wmcs-bots, which is exactly what the "non-critical" default is (admins) except in our team feed instead of the production one.

Change 528905 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] monitoring: Switch a collection of WMCS alerts to email-only

https://gerrit.wikimedia.org/r/528905

Change 528898 merged by Bstorm:
[operations/puppet@production] monitoring: change WMCS services to paging WMCS only

https://gerrit.wikimedia.org/r/528898

Change 528905 merged by Bstorm:
[operations/puppet@production] monitoring: Switch a collection of WMCS alerts to email-only

https://gerrit.wikimedia.org/r/528905

Change 528965 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: restore original sense of the load alert with prometheus

https://gerrit.wikimedia.org/r/528965

Change 528965 merged by Bstorm:
[operations/puppet@production] labstore: restore original sense of the load alert with prometheus

https://gerrit.wikimedia.org/r/528965

Change 529970 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] monitoring: Change the showmount check from toolforge to be email-only

https://gerrit.wikimedia.org/r/529970

Change 529970 merged by Bstorm:
[operations/puppet@production] monitoring: Change the showmount check from toolforge to be email-only

https://gerrit.wikimedia.org/r/529970