Page MenuHomePhabricator

Grafana alerts for save timing
Closed, ResolvedPublic

Assigned To
Authored By
Peter
Dec 14 2016, 5:29 AM
Referenced Files
F8942613: Screen Shot 2017-08-03 at 13.13.52.png
Aug 3 2017, 8:17 PM
F8459294: Grafana Save Timing Alerts.png
Jun 14 2017, 1:13 PM
F8459287: Screen Shot 2017-06-14 at 3.10.17 PM.png
Jun 14 2017, 1:13 PM
F8458591: Screen Shot 2017-06-14 at 11.43.34 AM.png
Jun 14 2017, 9:44 AM
F8227583: Save Timing Alerts.png
Jun 1 2017, 8:08 AM
F8223584: Screen Shot 2017-06-01 at 5.26.02 AM.png
Jun 1 2017, 3:27 AM
Tokens
"Love" token, awarded by hashar.

Description

Add alerts for save timing in frontend and backend (median and p75).

Event Timeline

aaron triaged this task as Medium priority.Dec 22 2016, 9:36 PM

@Peter please schedule a 1-on-1 with @aaron to walk him through this. This is one of our quarterly goals and Save Timing looks like it would be the easiest of all so far to set up alerts for.

I've did a version https://grafana.wikimedia.org/dashboard/db/save-timing-alerts?orgId=1 following the same pattern as for navigation timings. Let us use this one @aaron as a start and then we can tune and make it better.

Change 356449 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[operations/puppet@production] Add Save Timing alerts to Icinga

https://gerrit.wikimedia.org/r/356449

Let me re-do the current version tomorrow: doing % as navigation timing/wpt is probably not the best. I think we can do hard limits (at least for backend time).

Hmm had a look now and it isn't super clear. @aaron need your input what could be a good limit.

The alerts for save timings isn't working right now when I try to fire them manually:

Screen Shot 2017-06-01 at 5.26.02 AM.png (1×1 px, 151 KB)

I think the dashboard is ok now (need @aaron feedback on what kind of limits we should have):

Save Timing Alerts.png (2×2 px, 768 KB)

The problem that needs to be fixed is find out why the alerts get a 500. It "only" happen when the alert is fired for the query that do the diff and in % (that works for navtimings/wpt alerts). If I change for just alerting on the default query without diff/% it works fine.

A quick google search points out this https://github.com/grafana/grafana/issues/7061. Looks like the same issue but no reply yet from devs, just reporters

Yes, there's a couple of reports that looks something the same but not answer or just the issue have been closed. I changed the queries, it is something with combo of diffSeries and asPercent that make the backend fail. The current version (where alert works) just divideSeries and adds an offset so we can see the change in percent. This works, it is just a matter of finding good trigger values, need your help there @aaron !

FYI @aaron I've set the limits really hight now for the alerts so we can set it up in icinga and then you can think about what is a reasonable limit.

Change 356449 merged by Alexandros Kosiaris:
[operations/puppet@production] Add Save Timing alerts to Icinga

https://gerrit.wikimedia.org/r/356449

I'll have to adjust the moving average window. It will likely need to be much larger to avoid noise alerts (and to be able to lower the threshold).

There's a significant spike that started on June 9th (mostly seen on backend) and the alerts haven't caught it.

I think that is my fault. The alerts is set to 50% but it should be 0.50 (you can see the values when you test fire the alerts). I'll change that.

Fixed now. The problem from the beginning is #3306827 that we can't use the same as the other alerts.

I've changed now so the right graphs show the latest 30d (as the other graphs) so it is hopefully easier to spot regressions:

Screen Shot 2017-06-14 at 11.43.34 AM.png (526×1 px, 110 KB)

Can we close this now?

No, we should decrease the alerts, for some we have a 50% limit. I've changed the two for backend timing to 20% that seems more reasonable.

Is the alert that just happened due to the recovery improvement? Seems like backend save timing went back down significantly, almost to its levels before June 9th

No it's because I changed the limit from 50% to 20%:

Screen Shot 2017-06-14 at 3.10.17 PM.png (534×1 px, 94 KB)

So that should have fired a lot earlier if we would have configured it better:

Grafana   Save Timing Alerts.png (518×1 px, 99 KB)

Should we increase the limit, hmm.

ayounsi subscribed.

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Krinkle subscribed.

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Yeah, according to #wikimedia-perf-bots (and my icinga inbox) it's been flapping back and forth between CRITICAL/RECOVERY lots of times over the past days.

This is because it's probably too sensitive at this point. The data we are measuring is backend processing time for saving edits, which naturally varies a lot depending on the wiki in question (config, and installed extensions) as well as the edit itself (large submission/small submission, lots of templates or not so much etc.). Given about 500 edits per minute, the median isn't gonna be stable to <100ms as the data set will vary from one minute to another.

When the backend-timing p75 was closer to 800-900, 20% wasn't too sensitive but with it being 400-600ms, it's probably too sensitive (± 80ms vs ± 180ms).

For the time being, I've raised the backend-save-timing alert back to 0.5 (50%). Having said that, despite the recent flapping, even with 50%, there is definitely also a "real" regression going on right now, which is worth investigating.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1500879593680&to=1501791224192

Screen Shot 2017-08-03 at 13.13.52.png (1×2 px, 282 KB)

Closing this task, as it was about setting up these alerts, which was done. Filed T172447 about this regression.

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Yeah, according to #wikimedia-perf-bots (and my icinga inbox) it's been flapping back and forth between CRITICAL/RECOVERY lots of times over the past days.

This is because it's probably too sensitive at this point. The data we are measuring is backend processing time for saving edits, which naturally varies a lot depending on the wiki in question (config, and installed extensions) as well as the edit itself (large submission/small submission, lots of templates or not so much etc.). Given about 500 edits per minute, the median isn't gonna be stable to <100ms as the data set will vary from one minute to another.

When the backend-timing p75 was closer to 800-900, 20% wasn't too sensitive but with it being 400-600ms, it's probably too sensitive (± 80ms vs ± 180ms).

For the time being, I've raised the backend-save-timing alert back to 0.5 (50%). Having said that, despite the recent flapping, even with 50%, there is definitely also a "real" regression going on right now, which is worth investigating.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1500879593680&to=1501791224192

Screen Shot 2017-08-03 at 13.13.52.png (1×2 px, 282 KB)

Closing this task, as it was about setting up these alerts, which was done. Filed T172447 about this regression.