Grafana alerts for save timing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Peter
	Dec 14 2016, 5:29 AM

Description

Add alerts for save timing in frontend and backend (median and p75).

Details

	Subject	Repo	Branch	Lines +/-
	Add Save Timing alerts to Icinga	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	akosiaris	T140942 Tracking: Monitoring and alerts for "business" metrics
Resolved	• Gilles	T153166 Set up Grafana alerts for Web Performance metrics
Resolved	aaron	T153170 Grafana alerts for save timing

Event Timeline

Peter created this task.Dec 14 2016, 5:29 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2016, 5:29 AM

• Gilles assigned this task to aaron.Dec 15 2016, 9:39 PM

• Gilles moved this task from Inbox, needs triage to To-do: Goals prioritized current Quarter on the Performance-Team board.

aaron triaged this task as Medium priority.Dec 22 2016, 9:36 PM

Ladsgroup subscribed.Mar 16 2017, 8:19 AM

@Peter please schedule a 1-on-1 with @aaron to walk him through this. This is one of our quarterly goals and Save Timing looks like it would be the easiest of all so far to set up alerts for.

I've did a version https://grafana.wikimedia.org/dashboard/db/save-timing-alerts?orgId=1 following the same pattern as for navigation timings. Let us use this one @aaron as a start and then we can tune and make it better.

Change 356449 had a related patch set uploaded (by Phedenskog; owner: Phedenskog):
[operations/puppet@production] Add Save Timing alerts to Icinga

https://gerrit.wikimedia.org/r/356449

gerritbot added a project: Patch-For-Review.May 31 2017, 7:45 PM

hashar awarded a token.May 31 2017, 7:56 PM

Let me re-do the current version tomorrow: doing % as navigation timing/wpt is probably not the best. I think we can do hard limits (at least for backend time).

Hmm had a look now and it isn't super clear. @aaron need your input what could be a good limit.

The alerts for save timings isn't working right now when I try to fire them manually:

Screen Shot 2017-06-01 at 5.26.02 AM.png (1×1 px, 151 KB)

I think the dashboard is ok now (need @aaron feedback on what kind of limits we should have):

The problem that needs to be fixed is find out why the alerts get a 500. It "only" happen when the alert is fired for the query that do the diff and in % (that works for navtimings/wpt alerts). If I change for just alerting on the default query without diff/% it works fine.

A quick google search points out this https://github.com/grafana/grafana/issues/7061. Looks like the same issue but no reply yet from devs, just reporters

Yes, there's a couple of reports that looks something the same but not answer or just the issue have been closed. I changed the queries, it is something with combo of diffSeries and asPercent that make the backend fail. The current version (where alert works) just divideSeries and adds an offset so we can see the change in percent. This works, it is just a matter of finding good trigger values, need your help there @aaron !

FYI @aaron I've set the limits really hight now for the alerts so we can set it up in icinga and then you can think about what is a reasonable limit.

Change 356449 merged by Alexandros Kosiaris:
[operations/puppet@production] Add Save Timing alerts to Icinga

https://gerrit.wikimedia.org/r/356449

I'll have to adjust the moving average window. It will likely need to be much larger to avoid noise alerts (and to be able to lower the threshold).

There's a significant spike that started on June 9th (mostly seen on backend) and the alerts haven't caught it.

I think that is my fault. The alerts is set to 50% but it should be 0.50 (you can see the values when you test fire the alerts). I'll change that.

Fixed now. The problem from the beginning is #3306827 that we can't use the same as the other alerts.

I've changed now so the right graphs show the latest 30d (as the other graphs) so it is hopefully easier to spot regressions:

Screen Shot 2017-06-14 at 11.43.34 AM.png (526×1 px, 110 KB)

Can we close this now?

No, we should decrease the alerts, for some we have a 50% limit. I've changed the two for backend timing to 20% that seems more reasonable.

Is the alert that just happened due to the recovery improvement? Seems like backend save timing went back down significantly, almost to its levels before June 9th

No it's because I changed the limit from 50% to 20%:

Screen Shot 2017-06-14 at 3.10.17 PM.png (534×1 px, 94 KB)

So that should have fired a lot earlier if we would have configured it better:

Grafana Save Timing Alerts.png (518×1 px, 99 KB)

Should we increase the limit, hmm.

Krinkle moved this task from To-do: Goals prioritized current Quarter to Doing (old) on the Performance-Team board.Jun 14 2017, 7:41 PM

Krinkle removed a project: Patch-For-Review.Jul 11 2017, 2:09 AM

• Gilles closed this task as Resolved.Jul 12 2017, 7:15 PM

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

In T153170#3498222, @ayounsi wrote:

Reopening as the alert is going off again:

Status Information: CRITICAL: https://grafana.wikimedia.org/dashboard/db/save-timing-alerts is alerting: Backend time p75 [ALERT] alert.

Yeah, according to #wikimedia-perf-bots (and my icinga inbox) it's been flapping back and forth between CRITICAL/RECOVERY lots of times over the past days.

This is because it's probably too sensitive at this point. The data we are measuring is backend processing time for saving edits, which naturally varies a lot depending on the wiki in question (config, and installed extensions) as well as the edit itself (large submission/small submission, lots of templates or not so much etc.). Given about 500 edits per minute, the median isn't gonna be stable to <100ms as the data set will vary from one minute to another.

When the backend-timing p75 was closer to 800-900, 20% wasn't too sensitive but with it being 400-600ms, it's probably too sensitive (± 80ms vs ± 180ms).

For the time being, I've raised the backend-save-timing alert back to 0.5 (50%). Having said that, despite the recent flapping, even with 50%, there is definitely also a "real" regression going on right now, which is worth investigating.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1500879593680&to=1501791224192

Screen Shot 2017-08-03 at 13.13.52.png (1×2 px, 282 KB)

Closing this task, as it was about setting up these alerts, which was done. Filed T172447 about this regression.