Page MenuHomePhabricator

Determine & implement near-term method for escalating network alerts
Closed, ResolvedPublic


We had a discussion today at the weekly SRE infra foundations meeting about promoting fastnetmon "Possible DDOS" notifications from email to something more visible (IRC? Paging?). This led to a discussion about escalating network alerts to improve visibility in general. However, consensus was not yet reached about how to best implement noisier alert notifications from something that isn't icinga.

Creating this tracking task to further explore options to improve visibility of important network alerts and implement something in the near-term, with the understanding that larger scale efforts are under way to optimize alert escalation/notification for the long-term, and that these alerts will eventually be migrated to that flow. In other words, focusing on the low hanging fruit.

Event Timeline

herron triaged this task as Medium priority.Nov 6 2019, 10:37 PM
herron created this task.

In terms of “what” should be escalated, so far we discussed

  • Fastnetmon “Potential DDOS”
  • Interface saturation

What else is in scope here?

In terms of “how” I can think of a few options for starters

  1. Send alerts to IRC with a hashtag in the alert text. Coordinate with SRE adding this hashtag to everyones mention notifications.
  2. Gather metrics relevant to network problems and build alerts using existing icinga promethus query and grafana dashboard checks.
  3. Fire email alerts directly a mail alias containing the email-to-sms addresses for the team.
  4. Reuse portion of meta monitoring to deliver email-to-sms alerts honoring the schedule defined in icinga.

Generally speaking I do question if we need to add new SMS pages just yet. Of course we should add SMS pages where necessary, but also we should have high confidence that SMS alerts will be actionable, that corresponding IRC notifications have been sent for context within the channel, that documentation has been written up for the alert, etc.

In my opinion the metrics and hashtag/mention approach (1+2 above) would strike a good balance, and would position us to expand from there as needed.

Interface saturation

See also T224888

What else is in scope here?

That's everything I have in mind right now.

In terms of “how” I can think of a few options for starters

For Fastnetmon, I was previously thinking of:

  1. When FNM detects a potential DDoS, the script write (or touch) a file.
  2. In parallel, have an Icinga NRPE script that checks for that file and returns a CRITICAL when needed
  3. Then it's regular Icinga for escalation etc...

I discussed passive checks with @Jgreen as they use them a lot in Fundraising, but they seem to complex for our usecase.

The community version of FNM doesn't support Prometheus exports, only statsd iirc.

LibreNMS supports a bunch of transports. Basic IRC is already setup, the Icinga only works if LibreNMS runs on the same machine as Icinga (afaik).

I agree that all alerts need to be actionable, this might mean raising the FNM thresholds for example (or only paging on user impacting issues).

I'd rather not do (3), seems a step back (not respecting awake hours and such).

Regarding (1) we already have a proposal from last SRE summit, and some of us are already using it, but if doing (2) it's orthogonal as (2) would be already a complete solution.

In case we go with (4) let me know and I can help with the setup of the code.

(2) seems the more natural solution if there is an easy way to export the data. The other approach that could be done is to improve the monitoring and alerting on the LVSes so to catch more or less the same things from there.

(2) to me seems the way to go as it would integrate best with our existing workflows. With an eye pointed at low hanging fruits though I'm wondering how much work the integration would be and if such work would be worth it.

(4) sounds attractive to me as a generic enough solution "send an email to this address to page SRE", how much work do you think is likely going to be @Volans ? We'd lose the 1:1 mapping between pages and IRC notifications but I think as long as alerts@ is (B)CC'd we're ok.

@herron @fgiunchedi I don't think that much, I guess you have to do the triggering part, I'm not super clear what you have in mind, a script to run from somewhere or what. I'll be careful with an email alias as it could be easily abused.

As for there rest it should boil down to:

  • Extend [1] to sync the contacts to another place (maybe localhost on the icinga hosts directly?) unless you plan to re-use the meta-monitoring host. In that case no need to change anything. That script calls [2] to parse and convert Icinga contact list into a more user friendly format that is consumed by the current meta monitoring. I don't think [2] requires any modifications in the first iteration (it collects only ops contacts).
  • Extrapolate part of the code of the meta monitoring icinga check to a module so that can be consumed by both the current meta-monitoring and this new tool. Most likely just moving the SmtpNotifier class in [3] might be enough.
  • Decide on the transport layer:
    • if the code will live in production that means using our current SMTP, easy but add an internal dependency. It should be ok if the system will be monitored externally by the meta-monitoring itself, like we monitor Icinga, to make sure we don't fly blind.
    • if the code will live in the meta-monitoring host we can re-use the SMTP config setup there. The pro is being out of our infra, the con is being out of our infra 😉. We probably don't want a streamline alerting system to rely on the meta-monitoring but instead be part of our infra while monitored externally IMHO.


Thanks! I think we should go with (2) (i.e. investigate integration between icinga (or grafana alerts, and from there icinga checks) for fastnetmon and librenms) so we get all niceties like irc, silence/acknowledge, contact groups etc

Change 567093 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] fastnetmon: set a very short ban_time

Change 567093 merged by CDanis:
[operations/puppet@production] fastnetmon: set a very short ban_time

Change 570509 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] fastnetmon: connect via NRPE to Icinga

Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:12:43Z] <cdanis> ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "disable-puppet 'rollout of I60692f0e8 T237587 cdanis'"

Change 570509 merged by CDanis:
[operations/puppet@production] fastnetmon: connect to Icinga via NRPE

Mentioned in SAL (#wikimedia-operations) [2020-02-06T19:23:31Z] <cdanis> manual puppet run on netflow1001 looked good; ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin A:netflow "run-puppet-agent --enable 'rollout of I60692f0e8 T237587 cdanis'"

Change 571341 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] fastnetmon NRPE: page on FNM alerts

Change 571341 merged by CDanis:
[operations/puppet@production] fastnetmon NRPE: page on FNM alerts & tweak name

I think we can call this closed? LibreNMS and Fastnetmon both send pages (via Icinga) quite well now.