Page MenuHomePhabricator

Alerts using WebPageTest
Closed, ResolvedPublic

Description

Grafana 4.0 gives us Alerts so let set them up.

Lets create a new dashboard WebPageTest alerts. Start with alerts for:

  • SpeedIndex median
  • Start render median

And have them by browser, logged-in status, desktop/mobile and first/second view. We need https://phabricator.wikimedia.org/T151197 to be merged first before we go live for real, but we can start testing setting alerts before.

Event Timeline

I've started to setup some graphs that can be used for alerting: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts

Still we need to increase the frequency and add more alerts for other URLs.

I've updated to show change in % . @Krinkle have a look when you have time, think we should agree on what we thing is the best way of doing it before we move with nav timing.

Been discussing this with my friends and got the feedback that doing a moving average for 24 h is hiding changes and flatten the curve. Check out the two first graphs on https://grafana-admin.wikimedia.org/dashboard/db/webpagetest-alerts?from=now-24h&to=now to compare.

Been testing out the Graph feature of alerts (you can hook up S3 or webdav) and then you will get a image of the graph that's alerting:

With the current version there's no way to use templates or hook in another graph, so the use is limiting. We can see the diff (the start paint is X % slower). What would be cool in the future is that if we could attach info from other graphs (page size/requests etc) to try to make it easier to see what's the reason of the regression.

I've changed the current ones to have an avg of last three hours, lets do it like that for a while. What would be nice is including history of the actual metrics, maybe it's doable if one metric per dashboard (do avoid clutter), lets test that.

We could do something like this with % in axis and SpeedIndex on the other one, not sure though how readable it would be though:

Ok been running it for a while on server on Digital Ocean just to test, seems to work but yesterday I got a strange alert, I wonder could it how I collect the metrics?

The alerts should be triggered by then 10% rule, and current value is 7.14 but I still got an alert (see how the graph is cut off in Slack):

The full graph looks like this:

The diff is 10.15 % in the metric but it's not reflected in the graph. And the conditions for the query looks like this:

Maybe I'm doing wrong with the 1h -> now?

I had the idea that we could use templates and templating the URL and browser to make it super easy but that doesn't work:

Let my try have multiple URLs in the same graph.

We should test adding repeating alerts dashboards with clickable links in the header to get to the drilldown page for that page. See the repeating dashboards in https://grafana.wikimedia.org/dashboard/db/performance-metrics as examples.

Been tuning and trying on different approaches for a while and wanna do a summary. First a couple of gotchas that I got when testing the alerts:

  • Grafana and alerts doesn't support templating (you can not use templates in the dashboards if you want alerts).
  • Grafana and alerts doesn't support repeat panels. If you want a alert on multiple metrics, you need to use the star(*) on that metrics (or inlcude the ones that you want.
  • When you change the queries in the dashboard, the alerts continue to live in the dashboard: You will have marks of alerts (that we hit the limit) but the graph will not be the same anymore, so it can look like we got an alert and the graph isn't matching the alert.

You can find the alert graphs here: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts

The two first means it will be a little more work to get the dashboards ready, however it is probably for the best because we want clean dashboards with only one metric/page per dashboard because it's hard to see, check this example:

Before Christmas I synched with @Krinkle and we both like the idea of showing changes in % instead of saying "SpeedIndex increased with 250" so I've continued to make the graphs using percentage. The goal is to be able to be sure if we find a regression that has 10% impact on SpeedIndex and start render. Can you spot the regressions in the graphs?

I've been testing setting up alerts on https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts and on my own instance https://dashboard.sitespeed.io/dashboard/db/alerts (to be able to get the alerts) and one thing I've seen is that sometimes when I get a alert, the dashboard image isn't matching:

You can see that the alert to the right has happened (the red line) but the graph isn't showing the data. I also seen the same in the graphs but changed too much data right now so I don't have an example :(. Let me fix that.

All graphs I've done follows the same structure: Take one metric, go back 1 day, summarize over 24 h (or take the median value) compare with the average value the latest X hours (or median), take the diff and compare with the first value. How can we do this smarter/better? Going back 24 hours means that if we get a permanent regression (that we of course want to catch) the alert dashboard will not show the regression after that timespan (it will compare only 24 h back in time), so it can be easy to think that the regression was not real?

A couple of things: Our metrics for Firefox is really unstable (see T155217) and I don't think it has nothing 2do with WebPageTest, I can see the same thing with sitespeed.io. In general our current tests with WebPageTests is a little flakey and we could probably decrees the flakiness by adding more runs.

I've been testing calculate the values with averages vs median and wanted to compare what it would look for us. The idea is that using medians we could flatten out high/lows. Lets see how it works. Every time the graph hits the red line, an alert would have been sent. The max value should be under 10%. And we don't want to have too low values either.

This is what it looks like if we use moving averages (24h for control and latest 6h) for SpeedIndex (the last 7 days):

Looking at SpeedIndex with moving median (24 h for control and latest 5h) the last 7 days:

The values from Firefox is pretty useless right now if we want to reach the 10% goal. It doesn't matter if we use median/averages, they go up/down too much.

Lets look at start render (same moving average settings):

And moving median:

We get the same here, Firefox values are too unstable. For Chrome we get more stable values using averages then median (direct opposite of what I thought).

I also tested going back 7d back in time (as we will do with Navigation Timing data) resulting in the following graphs:

@Krinkle @aaron @Gilles would be super if you can read through this until Thursday so we can discuss it as a part of our meeting!

Found out today that you can clear the alerts when you change the query, with the "Clear history" button:

but I get permission denied when I try on our instance.

More things I learned today:

Also when I talked to Carl (who built the alerts) I verified what we done so far is the right path: one metric in one dashboard, to make it as simple as possible.

I also think we can skip the images of the current alert dashboard for now. There will come updates where we can have more texts/Markdown in the alerts, then maybe we can have generated links to https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown (with the right params) instead.

I think I finally had a eureka moment. I've updated the dashboard (doing a remix of what you did before @Krinkle) but now we have three graphs in the same panel: Firefox has the three URLs tested and then the alert for that panels need all three URLs to have an increase of larger than 5% to do a alert. I don't know I didn't think of that before. We really want to find a regression that affect all pages and will "probably" work pretty good now.

The current version compare the average 24 hours back with the average last 6 hours. I also tried going back 7 days and compare that with the latest 24 hours (both average/median) but haven't seen so much difference but I could add those also to the dashboard so we can see how it looks over time.

I also added another view where we summarize the change for all three tested pages and take the average of that but I don't will be as good as keeping alerts on all three of them because we want to alert when all three are over a specific threshold, not if the averages is over threshold (even though that the graphs is easier to read in the latest example).

There's one drawback with the current way of doing things: When we get an alert the alert state will be set back to normal (not alerting) if the the regression continues over the current configured timespan for how long time we go back in time. That is ok for alerting but I think it is conceptual hard that you cannot look at the alert dashboard and know if there's an regression or not, you need to look at absolute values.

I think the setup now will do as the first versions of alerts for WPT if you all @Krinkle @aaron @Gilles agree, so have a look. It will be 8 alerts: 3 for SpeedIndex Desktop (Chrome, Firefox, IE), 3 for Start Render Desktop (Chrome, Firefox, IE) and 2 for Mobile (SpeedIndex and Start Render).

They all work like this: each panel includes three tested pages and if all of them at the same time has a regression over 5 % (configurable per page) an alert will be fired. The first 8 panels look back one day, compare the average for 24 h with the average of the last 6 h. I also added panels for 7 d back comparing with latest 24h so we have something to compare with. We can fine tune/change whatever as long as we agree that the first 8 seems like a ok start? https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts

Looks good to me!

A bit off topic; but in a way the Donald Trump page might not be a good pick because it's probably going to see a lot more editing for the foreseeable future than a regular page, resulting in drastic page size changes over time. The mix of the three pages is definitely a great idea to mitigate that, but I think for our purpose 3 "stable" pages might be better. I don't know if there is such a thing as a gigantic article with stable-ish content, though.

Yes I agree about the Trump page, we should lets decide on another page in the next meeting.

Also: The stub page (Ingrid Vang Nyman) is good to have but it doesn't work so good for alerts, we should have three articles that are more the same. With the current increase for Chrome the alerts wasn't fired because that stub didn't pickup the same change:

Change 336605 had a related patch set uploaded (by Phedenskog):
Test Obama instead of Trump for more stable tests

https://gerrit.wikimedia.org/r/336605

Lets switch back to Obama since that served us well before and then discuss what 2do for Ingrid Vang Nyman.

Change 336605 merged by jenkins-bot:
Test Obama instead of Trump for more stable tests

https://gerrit.wikimedia.org/r/336605

Change 338706 had a related patch set uploaded (by Phedenskog):
Test longer articles instead if stubs for better alerting

https://gerrit.wikimedia.org/r/338706

I think with the latest change (to a new article) and the changes I done to the graphs (going back 7d in time and always comparing moving averages for 24h) I think we should be ready to start testing the alerts by taking on T156245.

Change 338706 merged by jenkins-bot:
Test longer articles instead if stubs for better alerting

https://gerrit.wikimedia.org/r/338706

I think something is wrong in Grafana, I'll file an issue later today. We have alerts for WebPageTest too (not only navigation timing) that fires when none of the graphs is showing that something is wrong:

with a config that all queries should be over 5% (you can see that one query has had the max of 4%) and always keep last value for everything:

When I was finished with creating the issue I saw something:

Firing the rule actually got one of them to alert because it is above 5%. When zoom into the graph I can see that we are above 5% for that query:

but looking at the graph for 7 days, we haven't crossed the 5% line (max 4%).

We check the alerts with the median for a timespan of 3 hours, but that is not the same as look at the graph for 7 days.

So the mismatch between alerts and the graph on the dashboard is due to rounding/sampling on the graph because it's "zoomed out"?

Yeah but it is still strange that the max value change depending on zooming? I'll change the timespan from 3 hours to something larger, will check again for that navigation timing graphs.

Grabbing this, to try and copy over the alerts you've set up in Grafana to Puppet as graphite alerts.

Gilles triaged this task as High priority.Mar 9 2017, 1:34 PM

Looking at the first alert, SpeedIndex Desktop, I think I've found a graphite request to reproduce the logic of the Grafana alert, but I get a 503 error when I try it...

https://graphite.wikimedia.org/render?from=-9hours&until=now&width=1024&height=768&target=averageAbove(group(aliasByNode(asPercent(diffSeries(movingAverage(webpagetest.enwiki.anonymous.Barack_Obama.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%2724h%27),%20movingAverage(timeShift(webpagetest.enwiki.anonymous.Barack_Obama.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%277d%27),%20%2724h%27)),%20movingAverage(timeShift(webpagetest.enwiki.anonymous.Barack_Obama.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%277d%27),%20%2724h%27)),%203),aliasByNode(asPercent(diffSeries(movingAverage(webpagetest.enwiki.anonymous.Facebook.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%2724h%27),%20movingAverage(timeShift(webpagetest.enwiki.anonymous.Facebook.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%277d%27),%20%2724h%27)),%20movingAverage(timeShift(webpagetest.enwiki.anonymous.Facebook.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%277d%27),%20%2724h%27)),%203),aliasByNode(asPercent(diffSeries(movingAverage(webpagetest.enwiki.anonymous.Sweden.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%2724h%27),%20movingAverage(timeShift(webpagetest.enwiki.anonymous.Sweden.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%277d%27),%20%2724h%27)),%20movingAverage(timeShift(webpagetest.enwiki.anonymous.Sweden.us-east-1.Google_Chrome.firstView.SpeedIndex.median,%20%277d%27),%20%2724h%27)),%203)),%205)

OK, figured out what was wrong, but still haven't found how I can have the equivalent of the "if greater than and" mechanism. Summing the 3 metrics doesn't work, because any single one can be way above 5%, so no way to set a meaningful limit. averageAbove() allows me to filter out metrics that don't meet the "greater than" criteria, but then with what I have left I have no way to figure out how many got through the filter?

Maybe if I can compute the unconditional sum and substract the averageAbove-filtered sum... It's going to be one hell of a long graphite URL :)

sumSeries doesn't seem to work if one of them is "no data", which is the case after the averageAbove filtering. Sigh...

no chance we can just test fetching the alerts from the API? or do we need to move it graphite alerts?

Yeah I think this failed attempt to do it the existing way shows that T156245: Create Nagios Grafana alert checks is necessary. Grafana's conditional alerting is essential for us, as you discovered by creating those alerts, which means our use cases just don't fix in the simple graphite-based system we have for nagios at the moment.

Here's a new alert I don't understand:


https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts?panelId=15&fullscreen

Also something is wrong with the metrics for "Sweden". I can go back 6 days in time and they work, but going back 7 days get 0 result. But when I look at the graph over time we have had the metrics for a couple of weeks.

Well, I can answer that myself. The query goes back for 9 hours and take the median run.

Closing this as I think we're done with these, only the alert emailing remains, which is tracked on the Nagios task.

We need to address the flapping alert (Difference in size authenticated).

I've changed all of them to 20% instead of 5%, let keep it like that. Also cleared the history for that graph.

Something is wrong with alerts for start render for Firefox. The first graph is SpeedIndex, the second is Start Render:

It looks like the alert rule is the same: it should fire if all URLs are above the limit. That works in the first graph but not on the second one. I'll try to recreate the rules and clear the history.

I also changed all alerts to alert for 10% change instead of 5%. I wanna decrease the 9h timespan, but let us do that after the easter.

Gilles removed a project: Patch-For-Review.

Anything more needed before we close this @Gilles ? With the latest changes I feel it is ready: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts?refresh=5m&orgId=1

Continuously fine tuning can be another task.

Gilles claimed this task.

Sure, I think we're done now that we have the email alerts.