Page MenuHomePhabricator

Revisit paging strategy for frack servers
Closed, ResolvedPublic

Description

The current status quo is that when every individual frack server dies, a page is sent to every opsen. What usually happens is that a bunch of people show up on IRC and say "what is alnilam?" and when they eventually realize that it's a frack host, ignore it as they can't do much about it anyway.

This paging strategy made sense years ago, but nowadays the org has evolved in many ways: the TechOps team has grown to the point that most people don't even know much about frack and don't have access to it, fr-tech is a vertical and fr-tech-ops is now a team with some basic redundancy.

I think we should revisit our paging strategy for frack:

  • Should we page for individual server failures rather than service failures?
  • Should we page differently during periods of the year (e.g. end-of-year busy fundraising period)
  • Who should we page? Are random opsens more valuable than e.g. fr-tech software engineers at this point?

My goal is to having something that makes sense and that causes people to consider pages (fr-tech or not) as serious actionable events and not train them that it's something they can ignore (i.e. alert fatigue).

Event Timeline

My suggestion:

  • stop sending frack host alerts to Tech Ops pagers
  • make a new contact group fr-tech-ops-sms to receive only critical alerts for frack hosts
  • stop sending non-fundraising host alerts to fr-tech-ops users
  • adjust fr-tech-ops pager duty to 24x7

Later:

  • make a new contact group fr-tech-sms to receive only critical alerts for frack
  • consider services to monitor for cross-team alerts
  • consider adding fr-tech to some of the fundraising host alerting, but nothing to start

relevant puppet code:

modules/monitoring/manifests/service.pp

39     # If a service is set to critical and
40     # paging is not disabled for this machine in hiera,
41     # then use the "sms" contact group which creates pages.
42     $do_paging = hiera('do_paging', true)
43 
44     case $critical {
45         true: {
46             case $do_paging {
47                 true:    { $real_contact_groups = "${contact_group},sms,admins" }
48                 default: { $real_contact_groups = "${contact_group},admins" }
49             }
50         }
51         default: { $real_contact_groups = $contact_group }
52     }

Adding the "sms" contact group means it becomes paging.

But not because there is a setting that would directly say that this group means you get notified by SMS, no, "sms" is just a group which has (ops) people in it.

It's a misnomer actually. Each individual contact has settings whether they get notified by email or by SMS, not the group.

The other group "admins" is also badly named, it has just 1 member, the "irc" special user, that enables bot output on IRC.

The whole "critical => true" in monitor::service's actually just means "add this group of people as contacts" and not more. Then this group of people happens to have set their notification options as host-notify-by-email,host-notify-by-sms-gateway. That's a setting with each contact (user) in the private repo.

We would get paged for everything if we were contacts for everything but we are not, only for "critical" services.

So regarding the suggestions:

"stop sending frack host alerts to tech ops pagers" would mean: In modules/monitoring/manifests/service.pp (paste above) inside the "case $do_paging" there needs to be another case for "if on a fundraising host", and/or in base, in modules/profile/manifests/base.pp / modules/base/manifests/monitoring/host.pp (class base::monitoring::host) it needs logic to know whether it is on a frack host or not.

"make a new contact group fr-tech-ops-sms to receive only critical alerts for frack hosts" and "stop sending non-fundraising host alerts to fr-tech-ops users" would be the same switch.

"adjust fr-tech-ops pager duty to 24x7"" - this would also be setting of each individual contact (user) and groups are merely collections of users

summary: changing the contact group is always just about _who_ gets notified and changing _how_ somebody gets notified is their own choice at the user level.

That's why i mentioned the option of having 2 users, with different notification options and different group memberships.

my suggestion would be:

  • (optional) rename group "sms" to "core-ops" (or maybe "core-ops-sms") since it specifies a list of people, not a notification method, or at the least a combination of people and notification method
  • (optional) rename group "admins" to "bots" (because that's what it is, the IRC bot, for less confusion)
  • create new group "fr-tech-ops-sms" as suggested by Jeff, determine which people should be members, the ones that are all need to have individual notification options with sms in their contact, and each contact can set timezones to 24/7 or something else. If somebody is also in other non-FR groups but wants to have different settings for FR hosts vs. other hosts, for example "mail me about all core-ops things put only page me if it's frack), should have 2 users, on for each purpose and join groups accordingly.
  • make puppet changes in base that let puppet differentiate between "i'm on a frack host" or not (FQDN / hiera / facter ?)
    • based on that append group fr-tech-ops-sms to existing contact groups where icinga monitoring hosts are created
  • (later) make more puppet changes in specific roles where service checks are added and add "if FR then contactgroup =" to also do this for services and not just "host down"

@Dzahn thanks for the many clarifications! I think I understand. So as of today if "sms" does not show up in contact_groups for a host or service, individual Ops don't get email or sms notification. If that's correct, we're much closer than I initially thought.

We already have fr-tech-ops as a contact_group, and I think we don't need a separate fr-tech-ops-sms group.

Frack hosts are configured with passive collection only, in nsca_frack.cfg, and hosts in that file all have:

contact_groups admins,sms,fr-tech-ops

We can change that to just "admins,fr-tech-ops" and that will leave the IRC alerts and Casey & Jeff getting both email and SMS as we do now.

Then we can adjust our alert timeframes to 24x7 and remove me from "sms" so I don't get 24x7 main platform alerts.

I removed 'sms' from notification for frack hosts, and changed myself to 24x7.

I removed 'sms' from notification for frack hosts, and changed myself to 24x7.

...and removed myself from 'sms'

Another question...does it make sense to move IRC notifications out of #wikimedia-operations and into Wikimedia-Fundraising? I'm not sure of the mechanics of doing that, would we need another bot or can the current bot handle multiple channels?

So as of today if "sms" does not show up in contact_groups for a host or service, individual Ops don't get email or sms notification. If that's correct, we're much closer than I initially thought.

Yep, that's correct. In the past the admins group may have also had root@ in it, so there would have been separate emails for everything, but not the case anymore.

move IRC notifications out of #wikimedia-operations and into Wikimedia-Fundraising? I'm not sure of the mechanics of doing that, would we need another bot or can the current bot handle multiple channels?

The current bot can handle multiple channels and already does. There is a custom notification command for each channel. I have done it for other channels before, i'll add it for you.

Change 349255 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] nagios_common: add notification command for fundraising irc

https://gerrit.wikimedia.org/r/349255

The way the custom IRC notifications work:

  • add a special notification command which writes to a new logfile (https://gerrit.wikimedia.org/r/349255)
  • add a special icinga contact (naming scheme: irc-$channel) and make it use the new notification command ([x] done - added contact "irc-fundraising" in private repo, /srv/private/module/secret/secrets/nagios/contacts.cfg). It uses " host_notification_commands notify-host-by-irc-fundraising" and "service_notification_commands notify-service-by-irc-fundraising" which were added above
  • add the special contact to an existing contactgroup, like fr-tech (https://gerrit.wikimedia.org/r/349259)
  • configure the IRC bot which logfile it should watch and output into which channel (https://gerrit.wikimedia.org/r/349259)

Change 349259 had a related patch set uploaded (by Dzahn):
[operations/puppet@production] nagios_common: add IRC notifications for Wikimedia-Fundraising

https://gerrit.wikimedia.org/r/349259

Change 349255 merged by Dzahn:
[operations/puppet@production] nagios_common: add notification command for fundraising irc

https://gerrit.wikimedia.org/r/349255

Change 349259 merged by Dzahn:
[operations/puppet@production] nagios_common: add IRC notifications for Wikimedia-Fundraising

https://gerrit.wikimedia.org/r/349259

12:09 -!- icinga-wm [~icinga-wm@tegmen.wikimedia.org] has joined #wikimedia-fundraising
12:11 < icinga-wm> test for T163368

The second line was created with root@tegmen:/var/log/icinga# echo "test for T163368" > irc-fundraising.log

Test via actual Icinga web ui, select a random fundraising host, use "send custom notification" from drop-down menu:

12:20 < icinga-wm> CUSTOM - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms

works

I removed 'admins' from contact_groups for frack hosts, so we should stop seeing alerts re. frack hosts in #wikimedia-operations, so frack host alerts should no longer end up in any of the normal TechOps-monitored places. At the moment I don't see any icinga alerts that make sense to send to TechOps, but it's easy to adjust in nsca_frack.cfg later. Unless anyone has anything to tweak it's fine with me to close this ticket.

Jgreen claimed this task.

I think we resolved the core issues of this task, thus closing it.