Page MenuHomePhabricator

Grafana alerts for save timing
Closed, ResolvedPublic

Description

Add alerts for save timing in frontend and backend (median and p75).

Details

Related Gerrit Patches:
operations/puppet : productionAdd Save Timing alerts to Icinga

Event Timeline

Peter created this task.Dec 14 2016, 5:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2016, 5:29 AM
Gilles assigned this task to aaron.Dec 15 2016, 9:39 PM
aaron triaged this task as Medium priority.Dec 22 2016, 9:36 PM
Gilles added a subscriber: Gilles.May 31 2017, 1:38 PM

@Peter please schedule a 1-on-1 with @aaron to walk him through this. This is one of our quarterly goals and Save Timing looks like it would be the easiest of all so far to set up alerts for.

Peter added a comment.May 31 2017, 7:19 PM

I've did a version https://grafana.wikimedia.org/dashboard/db/save-timing-alerts?orgId=1 following the same pattern as for navigation timings. Let us use this one @aaron as a start and then we can tune and make it better.

Change 356449 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[operations/puppet@production] Add Save Timing alerts to Icinga

https://gerrit.wikimedia.org/r/356449

Peter added a comment.May 31 2017, 8:45 PM

Let me re-do the current version tomorrow: doing % as navigation timing/wpt is probably not the best. I think we can do hard limits (at least for backend time).

Peter added a comment.Jun 1 2017, 3:20 AM

Hmm had a look now and it isn't super clear. @aaron need your input what could be a good limit.

Peter added a comment.Jun 1 2017, 3:27 AM

The alerts for save timings isn't working right now when I try to fire them manually:

Peter added a comment.Jun 1 2017, 8:08 AM

I think the dashboard is ok now (need @aaron feedback on what kind of limits we should have):

The problem that needs to be fixed is find out why the alerts get a 500. It "only" happen when the alert is fired for the query that do the diff and in % (that works for navtimings/wpt alerts). If I change for just alerting on the default query without diff/% it works fine.

A quick google search points out this https://github.com/grafana/grafana/issues/7061. Looks like the same issue but no reply yet from devs, just reporters

Peter added a comment.Jun 2 2017, 8:25 AM

Yes, there's a couple of reports that looks something the same but not answer or just the issue have been closed. I changed the queries, it is something with combo of diffSeries and asPercent that make the backend fail. The current version (where alert works) just divideSeries and adds an offset so we can see the change in percent. This works, it is just a matter of finding good trigger values, need your help there @aaron !

Peter added a comment.Jun 2 2017, 8:58 AM

FYI @aaron I've set the limits really hight now for the alerts so we can set it up in icinga and then you can think about what is a reasonable limit.

Change 356449 merged by Alexandros Kosiaris:
[operations/puppet@production] Add Save Timing alerts to Icinga

https://gerrit.wikimedia.org/r/356449

aaron added a comment.Jun 2 2017, 7:08 PM

I'll have to adjust the moving average window. It will likely need to be much larger to avoid noise alerts (and to be able to lower the threshold).

There's a significant spike that started on June 9th (mostly seen on backend) and the alerts haven't caught it.

Peter added a comment.Jun 13 2017, 9:40 AM

I think that is my fault. The alerts is set to 50% but it should be 0.50 (you can see the values when you test fire the alerts). I'll change that.

Peter added a comment.Jun 13 2017, 9:44 AM

Fixed now. The problem from the beginning is #3306827 that we can't use the same as the other alerts.

Peter added a comment.Jun 14 2017, 9:44 AM

I've changed now so the right graphs show the latest 30d (as the other graphs) so it is hopefully easier to spot regressions:

Can we close this now?

No, we should decrease the alerts, for some we have a 50% limit. I've changed the two for backend timing to 20% that seems more reasonable.

Is the alert that just happened due to the recovery improvement? Seems like backend save timing went back down significantly, almost to its levels before June 9th

Peter added a comment.Jun 14 2017, 1:13 PM

No it's because I changed the limit from 50% to 20%:

So that should have fired a lot earlier if we would have configured it better:

Should we increase the limit, hmm.

Gilles closed this task as Resolved.Jul 12 2017, 7:15 PM
ayounsi reopened this task as Open.Aug 3 2017, 6:36 PM
ayounsi added a subscriber: ayounsi.

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Krinkle closed this task as Resolved.Aug 3 2017, 8:17 PM
Krinkle added a subscriber: Krinkle.

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Yeah, according to #wikimedia-perf-bots (and my icinga inbox) it's been flapping back and forth between CRITICAL/RECOVERY lots of times over the past days.

This is because it's probably too sensitive at this point. The data we are measuring is backend processing time for saving edits, which naturally varies a lot depending on the wiki in question (config, and installed extensions) as well as the edit itself (large submission/small submission, lots of templates or not so much etc.). Given about 500 edits per minute, the median isn't gonna be stable to <100ms as the data set will vary from one minute to another.

When the backend-timing p75 was closer to 800-900, 20% wasn't too sensitive but with it being 400-600ms, it's probably too sensitive (± 80ms vs ± 180ms).

For the time being, I've raised the backend-save-timing alert back to 0.5 (50%). Having said that, despite the recent flapping, even with 50%, there is definitely also a "real" regression going on right now, which is worth investigating.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1500879593680&to=1501791224192

Closing this task, as it was about setting up these alerts, which was done. Filed T172447 about this regression.

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Yeah, according to #wikimedia-perf-bots (and my icinga inbox) it's been flapping back and forth between CRITICAL/RECOVERY lots of times over the past days.

This is because it's probably too sensitive at this point. The data we are measuring is backend processing time for saving edits, which naturally varies a lot depending on the wiki in question (config, and installed extensions) as well as the edit itself (large submission/small submission, lots of templates or not so much etc.). Given about 500 edits per minute, the median isn't gonna be stable to <100ms as the data set will vary from one minute to another.

When the backend-timing p75 was closer to 800-900, 20% wasn't too sensitive but with it being 400-600ms, it's probably too sensitive (± 80ms vs ± 180ms).

For the time being, I've raised the backend-save-timing alert back to 0.5 (50%). Having said that, despite the recent flapping, even with 50%, there is definitely also a "real" regression going on right now, which is worth investigating.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1500879593680&to=1501791224192

Closing this task, as it was about setting up these alerts, which was done. Filed T172447 about this regression.