Page MenuHomePhabricator

Grafana alerts for Navigation Timing and related
Closed, ResolvedPublic


Alerts on:

  • responseStart (TTFB),
  • firstPaint,
  • loadEventEnd (by browser, by platform, median/p75/p95)

Group them like:
1 metric -> Ps, browsers, platforms
Ps -> metrics, browsers, platforms
Browser -> metrics, Ps, platforms
Platform -> metrics, Ps, browsers

Event Timeline

I don't see alerts on that dashboard?

That is right, there's no alerting yet, I've only set up the dashboard, lets look now how if they are stable.

I've been setting up graphs for our Navigation Timing metrics and first paint to see how the metrics will change over the day. All graphs go back 7 days and use the average for 24 hours, we then compare that with the average form the last 24 hours. It's a big window that flattens the curve but I think its not important to immediately find the regression, it's better to make sure we get automatic alerts for real regressions within one day.

An alternative is to choose a limit that we always compare to and then alert if we hit that limit (like for example 1 s first paint for median) but I really like the idea of comparing in back in time. The dashboard is here and I will add a couple of screen shots. The time span is for 7 days.

First paint for users that are not logged in:

Screen Shot 2017-01-17 at 1.48.27 PM.png (1×2 px, 298 KB)

First paint on mobile:

Screen Shot 2017-01-17 at 1.48.45 PM.png (530×2 px, 188 KB)

First paint North America:

Screen Shot 2017-01-17 at 1.48.58 PM.png (528×2 px, 162 KB)

Response start:

Screen Shot 2017-01-17 at 1.49.11 PM.png (1×2 px, 292 KB)


Screen Shot 2017-01-17 at 1.49.30 PM.png (1×2 px, 295 KB)

Especially considering that we'll have to define the alerts in puppet for now, comparing timespans is superior to a fixed limit that we would have to update over time.

Dashboards look good to me. Do we assume that the dip on LoadEventEnd on the high percentiles on the 14th is "natural" or some sort of low sampling artifact? It's quite sharp compared to the other changes seen on the dashboard.

I haven't looked into yet, been set it up. I've been adding alerts fine tuning. I need remove a couple of ones and then we also need to discuss when it is ok to have alerts (what values). Like if the data we have is going up and down up 20% (and we see that it is natural) do we do have 25% as a limit for an alert?

This in an example of when the panel isn't correct when we compare the alerts :

Screen Shot 2017-01-24 at 8.51.47 AM.png (1×2 px, 329 KB)

The current value was <3% but we still got an alert with the current config:

Screen Shot 2017-01-24 at 8.52.27 AM.png (264×1 px, 30 KB)

We take the median value of the query run over three hours. But the query has never been over the limit of 10%? Or should have the average? hmm.

"If execution error or timeout" is set to "Alerting". Maybe it's a default, but I think we should always set that to "Keep Last State" instead.

There's a fix for that in coming 4.2.0 so you know what triggered the alert.

Did a new remix and then it is 9 alerts for Navigation Timings (and related). If we set them up we can trim them.

I avoided Navigation Timings for authenticated users, they are too random to make it possible to do alerting. When WebPageTest works again for that logged in users we add something there + we could also add some smart alerts for the size, making easy to get an alert for a big diff in size for Javascript or CSS, that would be cool.

I have an example when I don't understand why alerts is being fired. The graphs looks like this.

Screen Shot 2017-02-20 at 7.10.47 AM.png (1×2 px, 422 KB)

And we can see that median max value has max been 7% and the config for it looks like this (only checking median):

Screen Shot 2017-02-20 at 7.13.42 AM.png (786×1 px, 102 KB)

I'll change the no data to keep last state, but I though that was only needed for error. However I think something is wrong with the query right?

It's hard to tell what happens if Graphite times out, for example.

Something if making that query fail intermitently. If you open the dashboard admin and keep hitting refresh, it will faile something like 1 out of 4 times, with an error message stating "Cannot read property 'length' of undefined".

It looks like a bug in Grafana's backend, though. When it fails, the "render" requests returns a PNG's binary instead of the usual JSON:

Capture.JPG (205×1 px, 43 KB)

This is the PNG in image form:

render.png (250×330 px, 1 KB)

It might be the symptom, not the cause, though. While the UI doesn't handle that render response correctly, the issue might be within the backend, unable to render the data occasionally.

Seems like we still get it even after we have all set to "Keep Last State". It took 11 hours until it recovered (and the graph is not near the limit). I'll try to change the alert query to not use now, instead one hour back in time or something like that (if it somehow misses values).

I've changed everything to "keep last state" but still get the problem. We only get it for Navigation Timing and not WebPageTest so could it be related to the query?

wrongalerts.png (2×2 px, 489 KB)

When I try the rule manually, I get that the alert will not fire.

Maybe it 's something with that the query goes back 3h -> now, I'll test to see if we can remove the last hour or so, so that the metrics has landed safely.

I've been adding the wrong syntax. Correct is now-1h so query(D,4h, now-1h) will not take the last hours of metrics in consideration.

That query didn't do any difference (the limit here should be any query > 10%):

alerts.png (1×2 px, 339 KB)
And we have "Keep last state" for everything.

It is the same as the WPT graphs when you zoom into the issue:

Screen Shot 2017-02-28 at 1.46.35 PM.png (544×2 px, 122 KB)

The alerts is doing right and the "perspective from above" when zooming out gives us the wrong impression. Hmm.

OK. So we need to tweak the formula and threshold if it's being hit that often, right? Before copying that stuff over to puppet.

I've increased the timespan for alerts, all now looks like this:
B, 8h, now-1h

Hope that will make it cleaner and more obvious when we get an alert.

@Gilles some of the values is stable ( , what about that I clean up the dashboard and we can turn it on?

We will then get: overall diff in firstPaint, diff firsPaint mobile vs desktop. Overall diff in TTFB and loadEventEnd. Everything for anonymous users.

I think it will be better to have it running?

Ok fixed that now. I don't have the right privileges to remove alerts in Grafana so we still have them there. I've changed also so p95 need to have a 20% change for us to fire the alert.

You can create alerts but not delete them?

The actual alerts, my Grafana user has too low privileges. Timo however can :)

I've changed the alert for Navigation Timing, we had a lot of alerts that hard to see because they did match when they actually happened. Instead of going back 4-8 h as we do for WPT we now go back 30 minutes and take the median there and run the alert test every 30 min. We can make it run more often too, lets see how this works out.

I just tried the new alert functionality in grafana. Very, very promising! I set up a couple of alerts, and found the process overall quite intuitive. The limit of one alert per graph might be a bit limiting in the long run, but it's easy enough to work around that.

The bigger issue that seems to be blocking me is that sending of emails from notification channels throws errors when testing. I have not received any alert mails, so those errors are likely legit. Is our grafana install supposed to be able to send mail, or is this a known issue / limitation?

@Krinkle just pointed me to T153167, which was declined since ops were opposed to supporting email notifications. Hopefully this doesn't throw us back to the fairly cumbersome status quo of setting up raw icinga alerts via puppet.

I've pushed a new version following the same structure as the alert for WebPageTest: alerts to the left and real values to the right:

I've increased the alerts for 95 percentile to 30-40% diff since that value depends so much what metrics we pickup. In this case we will at least catch fundamental problems like when we had that increase for opening a page in a new tab. I think the navigation timings is ready to go too, need your eyes on it @Krinkle @Gilles @aaron

We had some alerts the last days, the configuration isn't optimal:

Screen Shot 2017-04-24 at 10.40.02 AM.png (518×1 px, 97 KB)

Screen Shot 2017-04-24 at 10.39.50 AM.png (526×1 px, 99 KB)

When you zoom in you can see that we hit that 20%. BUT it is only for a short while (until the next time we run the alerts). We need to change so that we send an alert when we have hit the limit for multiple alert runs, there's nothing to act on right now.

I don't think it's possible to tell Grafana something like that in the alert configuration, though, is it? To only alert if the criteria matched N times in a row.

We could have longer time span (now it is 30 minutes if I remember correctly) and take the median of that. We can try a timespan of 2 hour or something, that should mean that we would at least have 1 hour with values over that alert setting.

I tried now with 2 hour span taking the median and I deleted old alerts so its' easier to see what happens.

I've changed now so we also can compare between the current value and one week back in time. This means that the graph is more cluttered and you need to choose the two metrics you wanna compare, but I still think it is good since you now easier see the difference and you don't need to go to another dashboard.

Screen Shot 2017-04-24 at 4.22.57 PM.png (712×1 px, 175 KB)

Looked good today and I changed all dashboards now to take the median of 2 hours, so we need one hour higher than our limit.

Peter triaged this task as High priority.May 30 2017, 9:39 AM

We still get alerts that is noice:

Screen Shot 2017-05-31 at 2.02.36 PM.png (1×2 px, 346 KB)

I think we can increase the timespan even bigger than 5h (=2.5 hours of median over our limit). 10h?

I think we can call it done (the rest is tuning) but missing adding it to Icinga?

Change 356382 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Add Navigation Timing alerts to Icinga

Cool I'll do the same then for save timings. I'll increase the timespan to 10h now.

Change 356382 merged by Alexandros Kosiaris:
[operations/puppet@production] Add Navigation Timing alerts to Icinga

This is live now :)

Dzahn added a subscriber: Dzahn.

Current Status: CRITICAL
(for 3d 16h 12m 20s)
Status Information: CRITICAL: is alerting: Response start anonymous mobile.
Performance Data:
Current Attempt: 3/3 (HARD state)
Last Check Time: 2017-09-25 16:16:07

Sorry, we keep forgetting to acknowledge those in Icinga because the emails and IRC notifications are enough for us. This alert is well explained, it's the result of comparing to older data, in this case comparing to a period where our metric collection had issues. I wish there was a way to auto-acknowledge alerts in Icinga, because we really don't need that part, the notifications are enough for our need.

It's not about not understanding what the alert is for, it's about avoiding the situation that we have alerts that everybody just learns to ignore. Alerts should be actionable for us in some way or another. Meanwhile we have another alert: "CRITICAL: is alerting: Backend time median [ALERT] alert. "

These alerts were set up for our team, they're mostly un-actionable by Ops. We were told to rely on Icinga to avoid reinventing the wheel and setting up yet another alert system to support. But it's the emails and IRC notifications we really cared about in our workflow. The Icinga dashboard tends to get forgotten about.

Which is why I suggest that if there's a way for these alerts to not attract your attention, that'd be great. We'd still be leveraging the existing infrastructure for what we need without generating unnecessary noise for Ops.

@Gilles I understand, let me think about what we can do here long-term, maybe i'll bring it up in our monitoring meeting for ideas. auto-ACK may not be the best solution because ACK means also "no more notifications until the next status change" which means you'd get 1 mail about it and then never hear from it again unless it fixes itself. Or is that what you would actually like?

That's fine for what do, yeah, we prefer to only get alerted on status changes.

I'll look into our options here, maybe using eventhandlers for that. I won't keep reopening the ticket for this one, but just ACK it with a reference to this ticket. or so..