Page MenuHomePhabricator

Send Grafana alerts to IRC #wikimedia-perf
Closed, ResolvedPublic

Description

Also send alerts to the IRC performance channel.

Event Timeline

Krinkle triaged this task as Medium priority.Dec 15 2016, 10:05 PM

Proposing the following implementation plan:

  • Build simple web service that implements webhook interface and stores the message in a Redis list (Node.js, perflogrepo, Tool Labs, Kubernetes).
  • Re-work the existing perflogbot to also send messages from this list to the #wikimedia-perf-bots channel.

Further elaboration later on (maybe?):

  • Move processing and formatting of webhook data from Grafana into a separate process that only produces messages into another Redis list (e.g. "irc-outgoing").
  • Also move current ResourceLoader-related polling into a separate process that also stores its messages in the "irc-outgoing" list.

This should help keep things more isolated so that when the irc bot needs to be rebooted (happens from time to time due to net splits) or if for some other reason it is unable to join the channel for a while, messages are preserved.

Resulting design:

  1. Webhooks receiver (producer for "webhook-inbox").
  2. ResourceLoader manifest poller (producer for "irc-outgoing").
  3. Grafana alerts formatter (consumes webhook-info, producer for "irc-outgoing").
  4. Simple IRC bot (consumes "irc-outgoing").

These can run together in the same Kubernetes deployment, and kept online by it as separate processes.

A few links I read about use of a Redis list as reliable message queue: http://softwareengineering.stackexchange.com/a/339769/75240, http://big-elephants.com/2013-09/building-a-message-queue-using-redis-in-go/, http://big-elephants.com/2013-10/tuning-redismq-how-to-use-redis-in-go/, https://danielkokott.wordpress.com/2015/02/14/redis-reliable-queue-pattern/.

That seems like a lot of work and Icinga provides that for free. I think Ops may take just as much offense for us working around Icinga that way as we would be if we did the same for email.

I would be more keen to make something very simple based on webhooks for debugging purposes (like a rudimentary page that records all alert hits) and once we find values that work, convert them to Icinga where we get the benefits of email, IRC, etc. for free. But even that might be overkill since we already have what we need in terms of checking if an alert triggered or not in Grafana. The information is there. That should be enough to design the alert metrics and thresholds.

What about use something like https://github.com/marvinpinto/irc-hooky to just make a POC? I just like getting notified. Checkout the alert happening at 19:00, in the graph it isn't near the limit, so it need some fine tuning and understanding,

If that's easy to deploy, yes, that seems like a better solution. https://irchooky.org/deployment.html looks easy enough. One config file to edit and one command to deploy it on AWS.

As for the alert you've taken a screenshot of, everything is shifted by an hour on the graph. The alter is only evaluated once per hour, that might be why.

Ah, I see what you mean. Even taking into account the one-hour shift it looks like the 19:00 alert was based on an absolute value, when you've only set a higher bound. I wonder if it's not to do with the fact that the metric is a percentage and that it's cast to an integer for the alert. I.E. "10%" becomes 10, but "-10%" also becomes 10.

Nevermind, the more I stare at it, the less sense it makes. We indeed need the full contents for the alerts to hope finding out what's going on here.

I noticed the setting "If execution error or timeout" is set to alerting. Maybe turn that off?

Yes let me do that, but I think it's something else, I have the same behaviour on my test server (I alert to slack). We can also split and just run one alert per dashboard to make it easier in the beginning to see and understand.

I can setup https://irchooky.org/ tomorrow or do you want try @Gilles ?

Go ahead, Félix is sick, I probably wouldn't have enough time tomorrow.

I tried it with demo sending a webhook but it never reached the channel. but it seems it's tied to Github/Atlas webhooks and not generic? Will look into it more.

Hmm yeah, seems like it requires writing an adapter for the Grafana webhook

I'll check if there's another out of the box solution.

@Gilles Yeah, given that we're going with Icinga, I suppose it's okay not to have IRC alerts for the drafted alerts not yet in Icinga. The web interface should suffice for debugging in the short term.

However, I agree with Peter that notifications would probably help the drafting process as we're otherwise more likely to forget or miss some outliers in the data, and subsequently not adapting the alert query as much as we should.

I think we'll have the answer to those questions about notifications once we figure out if the Grafana web UI is enough to create draft alerts or not. Peter ran into some strange things earlier in this task that I'm not sure we have an answer to yet.

I'll talk to ones of the Grafana devs tomorrow and show what I've done so far and maybe he can help me with the case where the Graph doesn't show a regression but we gets an alert. Also one thing we could do is do a PR to send alerts to IRC that would help us + the community. It doesn't seems like too much work (famous last words).

We have alerts now in #wikimedia-perf-bots for a few things (via Icinga/Nagios).