Onboard Perf Team to new Alerting Toolset
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jan 26 2021, 1:40 PM

Description

This task tracks onboarding Performance team alerts to Alertmanager, specifically:

Add performance team "contacts" (irc/email) to Alertmanager configuration
Audit and move Grafana alerts from Icinga to Alertmanager. Each alert must have at least team and severity labels set
De-provision Icinga checks for Performance dashboards

Details

Subject	Repo	Branch	Lines +/-
alertmanager: send recovery emails for performance	operations/puppet	production	+1 -0
icinga: remove Grafana alerts for Performance	operations/puppet	production	+0 -112
alertmanager: route Performance team alerts	operations/puppet	production	+9 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T272979 Onboard Perf Team to new Alerting Toolset
Resolved	fgiunchedi	T278210 Change repeat interval for performance team alerts
Open	None	T278514 Wishlist for AlertManager alerts from Grafana
Declined	None	T278923 Resolved emails sometimes as new email threads and sometimes not

Event Timeline

fgiunchedi created this task.Jan 26 2021, 1:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2021, 1:40 PM

• dpifke subscribed.Jan 26 2021, 4:01 PM

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Jan 27 2021, 11:40 AM

@fgiunchedi I'm gonna do that on our side, so ping me when you are ready to talk.

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Feb 1 2021, 7:54 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

In T272979#6793496, @Peter wrote:

@fgiunchedi I'm gonna do that on our side, so ping me when you are ready to talk.

Sounds good! To give an overview: to have proper routing each alert will need to have at least team and severity labels, in this case the team value could be performance (or perf ? I'm ok with either). The severity label is also not written in stone but to ease the transition from Icinga I think it makes sense to have (some of) the following: page, critical, warning, unknown, task. The first and last are new, and can be used to issue pages to the team (if applicable) and to create tasks to one/more phabricator projects.

The first step I think is to figure out what severity should correspond to which contact method for team=performance alerts, for a made-up example:

critical should IRC #wikimedia-perf and email performance-team@
warning should email performance-team@
otherwise open a task to PROJECT

Let me know what you think! Happy to set up a meeting as well if that's easier

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Feb 2 2021, 1:30 PM

@fgiunchedi great, lets setup a meeting as a start!

In T272979#6807753, @Peter wrote:

@fgiunchedi great, lets setup a meeting as a start!

Sweet! Meeting on Wed at 13 UTC (invite sent)

fgiunchedi moved this task from Inbox to In progress on the observability board.Feb 8 2021, 4:17 PM

Thanks for the meeting @fgiunchedi I will change two alerts first things tomorrow for the tests we run with WebPageReplay and set them up as an example, and then make them fire and see if we can up with a good way of grouping them.

Checked the docs for Grafana. To be able to add custom labels we need 7.4 but since it based on tags for Graphite it no use for us right now for the synthetics tests,

The docs also says: "All alert notifications contain a link back to the triggered alert in the Grafana instance. This URL is based on the domain setting in Grafana." that would be sweat if we can use.

Change 663238 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] alertmanager: route Performance team alerts

https://gerrit.wikimedia.org/r/663238

gerritbot added a project: Patch-For-Review.Feb 10 2021, 4:15 PM

In T272979#6819048, @Peter wrote:

Thanks for the meeting @fgiunchedi I will change two alerts first things tomorrow for the tests we run with WebPageReplay and set them up as an example, and then make them fire and see if we can up with a good way of grouping them.

Sweet! I've just published https://gerrit.wikimedia.org/r/c/operations/puppet/+/663238 for the routing as we discussed, PTAL. I'm not sure about the email address for example

In T272979#6819126, @Peter wrote:

Checked the docs for Grafana. To be able to add custom labels we need 7.4 but since it based on tags for Graphite it no use for us right now for the synthetics tests,

The docs also says: "All alert notifications contain a link back to the triggered alert in the Grafana instance. This URL is based on the domain setting in Grafana." that would be sweat if we can use.

Agreed that'd be sweet! We have a tracking task for Grafana upgrade (https://phabricator.wikimedia.org/T263747) FWIW

I've tested with two alerts just to see what it looks like:

Screenshot 2021-02-11 at 08.47.17.png (1×2 px, 1 MB)

So I need to be better at naming each metrics :) The test we run tests three different URLs and if all three are above the threshold, the alert is fired.

It seems that the runbook link isn't created? In Grafana it looks like this, is that correct or should I do something special for the link:

Screenshot 2021-02-11 at 08.52.47.png (430×1 px, 210 KB)

I think would be great to find a way to include the URL to the dashboard to make the alerts in the alert manager more actionable.

Now I grouped per tool that run the tests, I need to think about how we best should do that before I move on.

Made some changes:

Screenshot 2021-02-11 at 09.33.13.png (1×2 px, 1 MB)

I've added the URL to the dashboard in the message, that's ok ok I think.

Then I feel we should group the tests per wiki instead. So for example we have one group for en.wiki (same for mobile/desktop?) and then all tools report under that name and tag what tool that is alerting.

For the RUM metrics we don't needy to group them so the name is one per metric we test.

For reference: Each condition that we add to the alert in Grafana:

Screenshot 2021-02-11 at 09.55.42.png (448×1 px, 238 KB)

will have their own section with description in the alert manager. In that example we test one URL or query and the last query checks if we show a banner or not. All these for condition will have their own section in the alert manager and reuse the message set in Grafana.

In T272979#6821662, @Peter wrote:

It seems that the runbook link isn't created? In Grafana it looks like this, is that correct or should I do something special for the link:

That's interesting! Yes I think the reason why the link isn't rendered is that alert "annotations" are given that treatment in Karma and not alert "labels", it seems Grafana sends only labels (named 'tags') and not arbitrary annotations :( The annotations that are sent however are: summary / description and image (which in theory should be enabled, not sure why we're not seeing the image URL included though)

@fgiunchedi is images configured? I think they need to be uploaded somewhere, before when I used it on my personal setup I needed to configure so Chromium (before PhantomJS) took a screenshot of the graph and then upload it somewhere so the image are accessible.

In T272979#6822772, @Peter wrote:

@fgiunchedi is images configured? I think they need to be uploaded somewhere, before when I used it on my personal setup I needed to configure so Chromium (before PhantomJS) took a screenshot of the graph and then upload it somewhere so the image are accessible.

AFAICT it should work, in the sense that I can do direct-linking of renders of panels, e.g. this swift render. We have grafana-image-renderer plugin installed (thus chromium headless)

However I just noticed that the direct render doesn't seem to work for all dashboards. e.g. edit-count doesn't work however the webpagerelay ca alerts works.

In T272979#6821750, @Peter wrote:

Made some changes:

I've added the URL to the dashboard in the message, that's ok ok I think.

Agreed that's likely good enough for now

Then I feel we should group the tests per wiki instead. So for example we have one group for en.wiki (same for mobile/desktop?) and then all tools report under that name and tag what tool that is alerting.

There's indeed many ways to solve this problem, as a matter of best practice the alert name is usually a combination of "thing" and/or "symptom", e.g. in the case above one could have:

alertname: WebPageReplaySlowFirstVisualChange

alertname: WebPageReplaySpeedIndexRegression

across all relevant dashboards, each group then will have the tags e.g. per wiki

@fgiunchedi I have two questions:

is there a way to convert/massage the data that comes from Grafana before it ends up in the Alert Manager? I'm thinking if there's a way to get rid of one message per alert condition as in https://phabricator.wikimedia.org/T272979#6821783? We need to have three/four different conditions and but since we use AND conditions all need to match = one message in the alert manager is enough and make the alert easier to read/understand.

is there a way today to escalate X amounts of warnings into a critical alert in the Alert Manager? For example we test 12 different Wikipedias, if a couple fails its a warning but if all its critical? I understand we can do that directly in Grafana but the alert setup there are still pretty basic so creating that amount of alert queries will be hard to get right.

Change 663238 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: route Performance team alerts

https://gerrit.wikimedia.org/r/663238

Maintenance_bot removed a project: Patch-For-Review.Feb 12 2021, 9:10 AM

fgiunchedi updated the task description. (Show Details)Feb 12 2021, 1:28 PM

In T272979#6819126, @Peter wrote:

Checked the docs for Grafana. To be able to add custom labels we need 7.4 but since it based on tags for Graphite it no use for us right now for the synthetics tests,

The docs also says: "All alert notifications contain a link back to the triggered alert in the Grafana instance. This URL is based on the domain setting in Grafana." that would be sweat if we can use.

Grafana has been upgraded to 7.4 today, in case you'd like to do further tests

In T272979#6825655, @Peter wrote:

@fgiunchedi I have two questions:

is there a way to convert/massage the data that comes from Grafana before it ends up in the Alert Manager? I'm thinking if there's a way to get rid of one message per alert condition as in https://phabricator.wikimedia.org/T272979#6821783? We need to have three/four different conditions and but since we use AND conditions all need to match = one message in the alert manager is enough and make the alert easier to read/understand.

Following up from IRC -- there isn't such a built-in way in AM

is there a way today to escalate X amounts of warnings into a critical alert in the Alert Manager? For example we test 12 different Wikipedias, if a couple fails its a warning but if all its critical? I understand we can do that directly in Grafana but the alert setup there are still pretty basic so creating that amount of alert queries will be hard to get right.

The closest functionality (but not quite) would be to use "inhibit" AM feature to not fire the warning alerts if there's already critical alert, although the critical alert will still need to be sent somehow

I'll start today and add some new alerts that that will fire critical if our measuring tools isn't working. Those queries will be easier and the output will look better/more understandable in AlertManager.

For the others more complicated queries, I wanna get some feedback from the rest of my team members, I'll do that on Monday,

I tried one of the new alerts and we got the email and the alert in IRC and Alert Manager (https://grafana.wikimedia.org/d/frWAt6PMz/synthetic-tool-alerts).

Yes I think the reason why the link isn't rendered is that alert "annotations" are given that treatment in Karma and not alert "labels", it seems Grafana sends only labels (named 'tags') and not arbitrary annotations

@fgiunchedi is that something you can fix so they are linked and the alerts in the Alert Manager is readable?

In T272979#6843079, @Peter wrote:

I tried one of the new alerts and we got the email and the alert in IRC and Alert Manager (https://grafana.wikimedia.org/d/frWAt6PMz/synthetic-tool-alerts).

Yes I think the reason why the link isn't rendered is that alert "annotations" are given that treatment in Karma and not alert "labels", it seems Grafana sends only labels (named 'tags') and not arbitrary annotations

@fgiunchedi is that something you can fix so they are linked and the alerts in the Alert Manager is readable?

Not AFAIK, it'd be nice though if Grafana could be instructed to send alertmanager annotations (not to be confused with grafana annotations) in addition to tags. I've just opened a Grafana feature request: https://github.com/grafana/grafana/issues/31345

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Feb 23 2021, 6:10 AM

Ok, so then there's nothing to do at the moment right?

Then my plan is to start convert all our alerts tomorrow by adding the AlertManager and severity/team to all of them and tune the title/description to better reflect what is wrong.

And then I will let you know @fgiunchedi when I'm finished and you can verify.

I think one tool per day is a good pace :) Let me start with the WebPageTest alerts today.

Just ran into this blog post from Grafana that might be useful: https://grafana.com/blog/2021/02/24/you-should-know-about-transformations-in-grafana/

I haven't started to try out transformations yet, thanks.

Adding a couple of screenshots that we can talk about later today:

Screenshot 2021-02-25 at 09.52.37.png (238×2 px, 243 KB)

Screenshot 2021-02-25 at 10.11.46.png (682×2 px, 859 KB)

Here's my notes from the sync meeting with @fgiunchedi :

Make the name/subject on the alert descriptive, for example: First paint is slow on mobile (or something like that
Add the tool name (that fired the alert) as a tag
Add links to dashboard(s) as tags and use the "message" part of Grafana to just have text.

Hi @fgiunchedi I have two questions:

Yesterday our mobile phone provider tests went down and in Grafana I can see that the alert started:

Screenshot 2021-03-08 at 09.14.44.png (600×1 px, 218 KB)

That looks good and the problem is still there (I reported it upstream). On the IRC channel I can see that the alerts keep firing but they are also resolved:

Screenshot 2021-03-08 at 09.01.24.png (964×1 px, 1 MB)

Does that mean that someone resolved it through the GUI at alerts.wikimedia.org?

Is it possible to configure the subject in the alert emails? On my phone it looks like this:

The subject starts with: [performance team] [FIRING 1] Andr ... or whatever could fit. I wonder if the email could start with the reason instead, so its easy just to look at the subject and understand what's wrong?

In T272979#6891137, @Peter wrote:

Hi @fgiunchedi I have two questions:

Yesterday our mobile phone provider tests went down and in Grafana I can see that the alert started:

That looks good and the problem is still there (I reported it upstream). On the IRC channel I can see that the alerts keep firing but they are also resolved:

Does that mean that someone resolved it through the GUI at alerts.wikimedia.org?

I'm not sure why the alert would flap, can you post the timestamps as well? My suspicion now is that Grafana evaluates the alert every X, and thus at most sends alerts every X (in cases like this when the alert keeps firing). However AM requires clients to keep sending alerts that are firing (every 3m more or less IIRC), thus I think what's happening is that Grafana evaluates the alert, sends it, AM sees the alert and fires it, then some time passes and Grafana isn't sending the same alert again, AM considers the alert resolved. This repeats at the next Grafana evaluation cycle. The first test I suggest is to try with a shorter evaluation cycle for the alert!

Is it possible to configure the subject in the alert emails? On my phone it looks like this:

The subject starts with: [performance team] [FIRING 1] Andr ... or whatever could fit. I wonder if the email could start with the reason instead, so its easy just to look at the subject and understand what's wrong?

I'll look into how to shorten the [FIRING 1] prefix, however the [performance-team] prefix is a mailman setting AFAICT

I've added AlertManager alerts for the most of our alerts now. What's missing is the WebPageReplay alerts. Hopefully I will finish that tomorrow and then I would love your input @fgiunchedi to see if there's something missing or should be changed. I'll ping you when I'm done.

Change 672346 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: remove Grafana alerts for Performance

https://gerrit.wikimedia.org/r/672346

gerritbot added a project: Patch-For-Review.Mar 15 2021, 9:10 AM

fgiunchedi updated the task description. (Show Details)Mar 15 2021, 9:35 AM

Change 672346 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: remove Grafana alerts for Performance

https://gerrit.wikimedia.org/r/672346

Maintenance_bot removed a project: Patch-For-Review.Mar 15 2021, 11:11 AM

This is all completed now, specifically all Performance alerts that used to go through Icinga now are sent from Grafana itself to Alertmanager and show up on https://alerts.wikimedia.org, and have notifications routed accordingly

fgiunchedi closed this task as Resolved.Mar 15 2021, 1:00 PM

fgiunchedi claimed this task.

fgiunchedi closed subtask T278210: Change repeat interval for performance team alerts as Resolved.Mar 24 2021, 7:37 AM

Change 674803 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] alertmanager: send recovery emails for performance

https://gerrit.wikimedia.org/r/674803

gerritbot added a project: Patch-For-Review.Mar 25 2021, 8:18 AM

Change 674803 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: send recovery emails for performance

https://gerrit.wikimedia.org/r/674803

Maintenance_bot removed a project: Patch-For-Review.Mar 25 2021, 9:10 AM

Peter mentioned this in T278514: Wishlist for AlertManager alerts from Grafana.Mar 26 2021, 8:34 AM

Peter added a subtask: T278514: Wishlist for AlertManager alerts from Grafana.

fgiunchedi mentioned this in T281358: Move Performance Icinga alerts to AlertManager.Apr 28 2021, 12:34 PM

	F34144910: Screenshot 2021-03-08 at 09.01.24.png
	Mar 8 2021, 8:23 AM

	F34144903: Screenshot 2021-03-08 at 09.14.44.png
	Mar 8 2021, 8:23 AM

	F34121992: Screenshot 2021-02-25 at 09.52.37.png
	Feb 25 2021, 9:21 AM

	F34121993: Screenshot 2021-02-25 at 10.11.46.png
	Feb 25 2021, 9:21 AM

	F34100333: Screenshot 2021-02-11 at 09.55.42.png
	Feb 11 2021, 8:58 AM

	F34100322: Screenshot 2021-02-11 at 09.33.13.png
	Feb 11 2021, 8:37 AM

	F34100291: Screenshot 2021-02-11 at 08.47.17.png
	Feb 11 2021, 7:58 AM

Onboard Perf Team to new Alerting Toolset Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Onboard Perf Team to new Alerting Toolset
Closed, ResolvedPublic
Actions

Related Objects
Search...