Also send alerts to the IRC performance channel.
|Open||None||T140942 Tracking: Monitoring and alerts for "business" metrics|
|Resolved||Gilles||T153166 Set up Grafana alerts for Web Performance metrics|
|Resolved||Peter||T153168 Send Grafana alerts to IRC #wikimedia-perf|
Proposing the following implementation plan:
- Build simple web service that implements webhook interface and stores the message in a Redis list (Node.js, perflogrepo, Tool Labs, Kubernetes).
- Re-work the existing perflogbot to also send messages from this list to the #wikimedia-perf-bots channel.
Further elaboration later on (maybe?):
- Move processing and formatting of webhook data from Grafana into a separate process that only produces messages into another Redis list (e.g. "irc-outgoing").
- Also move current ResourceLoader-related polling into a separate process that also stores its messages in the "irc-outgoing" list.
This should help keep things more isolated so that when the irc bot needs to be rebooted (happens from time to time due to net splits) or if for some other reason it is unable to join the channel for a while, messages are preserved.
- Webhooks receiver (producer for "webhook-inbox").
- ResourceLoader manifest poller (producer for "irc-outgoing").
- Grafana alerts formatter (consumes webhook-info, producer for "irc-outgoing").
- Simple IRC bot (consumes "irc-outgoing").
These can run together in the same Kubernetes deployment, and kept online by it as separate processes.
A few links I read about use of a Redis list as reliable message queue: http://softwareengineering.stackexchange.com/a/339769/75240, http://big-elephants.com/2013-09/building-a-message-queue-using-redis-in-go/, http://big-elephants.com/2013-10/tuning-redismq-how-to-use-redis-in-go/, https://danielkokott.wordpress.com/2015/02/14/redis-reliable-queue-pattern/.
That seems like a lot of work and Icinga provides that for free. I think Ops may take just as much offense for us working around Icinga that way as we would be if we did the same for email.
I would be more keen to make something very simple based on webhooks for debugging purposes (like a rudimentary page that records all alert hits) and once we find values that work, convert them to Icinga where we get the benefits of email, IRC, etc. for free. But even that might be overkill since we already have what we need in terms of checking if an alert triggered or not in Grafana. The information is there. That should be enough to design the alert metrics and thresholds.
Ah, I see what you mean. Even taking into account the one-hour shift it looks like the 19:00 alert was based on an absolute value, when you've only set a higher bound. I wonder if it's not to do with the fact that the metric is a percentage and that it's cast to an integer for the alert. I.E. "10%" becomes 10, but "-10%" also becomes 10.
Yes let me do that, but I think it's something else, I have the same behaviour on my test server (I alert to slack). We can also split and just run one alert per dashboard to make it easier in the beginning to see and understand.
@Gilles Yeah, given that we're going with Icinga, I suppose it's okay not to have IRC alerts for the drafted alerts not yet in Icinga. The web interface should suffice for debugging in the short term.
However, I agree with Peter that notifications would probably help the drafting process as we're otherwise more likely to forget or miss some outliers in the data, and subsequently not adapting the alert query as much as we should.
I think we'll have the answer to those questions about notifications once we figure out if the Grafana web UI is enough to create draft alerts or not. Peter ran into some strange things earlier in this task that I'm not sure we have an answer to yet.
I'll talk to ones of the Grafana devs tomorrow and show what I've done so far and maybe he can help me with the case where the Graph doesn't show a regression but we gets an alert. Also one thing we could do is do a PR to send alerts to IRC that would help us + the community. It doesn't seems like too much work (famous last words).