Page MenuHomePhabricator

Determine & implement near-term method for escalating network alerts
Open, MediumPublic

Description

We had a discussion today at the weekly SRE infra foundations meeting about promoting fastnetmon "Possible DDOS" notifications from email to something more visible (IRC? Paging?). This led to a discussion about escalating network alerts to improve visibility in general. However, consensus was not yet reached about how to best implement noisier alert notifications from something that isn't icinga.

Creating this tracking task to further explore options to improve visibility of important network alerts and implement something in the near-term, with the understanding that larger scale efforts are under way to optimize alert escalation/notification for the long-term, and that these alerts will eventually be migrated to that flow. In other words, focusing on the low hanging fruit.

Event Timeline

herron triaged this task as Medium priority.Nov 6 2019, 10:37 PM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 6 2019, 10:37 PM

In terms of “what” should be escalated, so far we discussed

  • Fastnetmon “Potential DDOS”
  • Interface saturation

What else is in scope here?

In terms of “how” I can think of a few options for starters

  1. Send alerts to IRC with a hashtag in the alert text. Coordinate with SRE adding this hashtag to everyones mention notifications.
  2. Gather metrics relevant to network problems and build alerts using existing icinga promethus query and grafana dashboard checks.
  3. Fire email alerts directly a mail alias containing the email-to-sms addresses for the team.
  4. Reuse portion of meta monitoring to deliver email-to-sms alerts honoring the schedule defined in icinga.

Generally speaking I do question if we need to add new SMS pages just yet. Of course we should add SMS pages where necessary, but also we should have high confidence that SMS alerts will be actionable, that corresponding IRC notifications have been sent for context within the channel, that documentation has been written up for the alert, etc.

In my opinion the metrics and hashtag/mention approach (1+2 above) would strike a good balance, and would position us to expand from there as needed.

ayounsi added a subscriber: Jgreen.Nov 7 2019, 12:04 AM

Interface saturation

See also T224888

What else is in scope here?

That's everything I have in mind right now.

In terms of “how” I can think of a few options for starters

For Fastnetmon, I was previously thinking of:

  1. When FNM detects a potential DDoS, the fastnetmon_notify.py script write (or touch) a file.
  2. In parallel, have an Icinga NRPE script that checks for that file and returns a CRITICAL when needed
  3. Then it's regular Icinga for escalation etc...

I discussed passive checks with @Jgreen as they use them a lot in Fundraising, but they seem to complex for our usecase.

The community version of FNM doesn't support Prometheus exports, only statsd iirc.

LibreNMS supports a bunch of transports. Basic IRC is already setup, the Icinga only works if LibreNMS runs on the same machine as Icinga (afaik).

I agree that all alerts need to be actionable, this might mean raising the FNM thresholds for example (or only paging on user impacting issues).

Volans added a subscriber: Volans.Nov 8 2019, 9:44 AM

I'd rather not do (3), seems a step back (not respecting awake hours and such).

Regarding (1) we already have a proposal from last SRE summit, and some of us are already using it, but if doing (2) it's orthogonal as (2) would be already a complete solution.

In case we go with (4) let me know and I can help with the setup of the code.

(2) seems the more natural solution if there is an easy way to export the data. The other approach that could be done is to improve the monitoring and alerting on the LVSes so to catch more or less the same things from there.

(2) to me seems the way to go as it would integrate best with our existing workflows. With an eye pointed at low hanging fruits though I'm wondering how much work the integration would be and if such work would be worth it.

(4) sounds attractive to me as a generic enough solution "send an email to this address to page SRE", how much work do you think is likely going to be @Volans ? We'd lose the 1:1 mapping between pages and IRC notifications but I think as long as alerts@ is (B)CC'd we're ok.

Friendly ping to @Volans about @fgiunchedi question above

@herron @fgiunchedi I don't think that much, I guess you have to do the triggering part, I'm not super clear what you have in mind, a script to run from somewhere or what. I'll be careful with an email alias as it could be easily abused.

As for there rest it should boil down to:

  • Extend [1] to sync the contacts to another place (maybe localhost on the icinga hosts directly?) unless you plan to re-use the meta-monitoring host. In that case no need to change anything. That script calls [2] to parse and convert Icinga contact list into a more user friendly format that is consumed by the current meta monitoring. I don't think [2] requires any modifications in the first iteration (it collects only ops contacts).
  • Extrapolate part of the code of the meta monitoring icinga check to a module so that can be consumed by both the current meta-monitoring and this new tool. Most likely just moving the SmtpNotifier class in [3] might be enough.
  • Decide on the transport layer:
    • if the code will live in production that means using our current SMTP, easy but add an internal dependency. It should be ok if the system will be monitored externally by the meta-monitoring itself, like we monitor Icinga, to make sure we don't fly blind.
    • if the code will live in the meta-monitoring host we can re-use the SMTP config setup there. The pro is being out of our infra, the con is being out of our infra 😉. We probably don't want a streamline alerting system to rely on the meta-monitoring but instead be part of our infra while monitored externally IMHO.

[1] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/icinga/files/sync_check_icinga_contacts.sh
[2] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/icinga/files/generate_check_icinga_contacts.py
[3] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/master/icinga/check_icinga.py

Thanks! I think we should go with (2) (i.e. investigate integration between icinga (or grafana alerts, and from there icinga checks) for fastnetmon and librenms) so we get all niceties like irc, silence/acknowledge, contact groups etc

fgiunchedi moved this task from Inbox to Up next on the observability board.Nov 25 2019, 1:53 PM

FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188

CDanis claimed this task.Thu, Dec 19, 3:12 PM