Page MenuHomePhabricator

Migrate Zuul alerting to Grafana / AlertManager
Closed, ResolvedPublic

Description

We have an alarm firing when there are too many Gearman functions waiting. It relies on monitoring Graphite and is defined in Puppet via:

modules/zuul/manifests/monitoring/server.pp
monitoring::graphite_threshold{ 'zuul_gearman_wait_queue':
    ensure          => $ensure,
    description     => 'Work requests waiting in Zuul Gearman server',
    dashboard_links => ['https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1'],
    metric          => 'zuul.geard.queue.waiting',
    contact_group   => 'contint',
    from            => '10min',
    percentage      => 100,
    warning         => 90,
    critical        => 150,
    notes_link      => 'https://www.mediawiki.org/wiki/Continuous_integration/Zuul',
}

The related Grafana dashboard is https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 and has an alert defined which is not necessarily consistent with what is defined in Puppet:

zuul-gearman-grafana-alert.png (520×767 px, 46 KB)

Guides coming the observability team:

https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts

If our team hasn't been onboarded to alert manager:

https://wikitech.wikimedia.org/wiki/Alertmanager#I'm_part_of_a_new_team_that_needs_onboarding_to_Alertmanager,_what_do_I_need_to_do
https://wikitech.wikimedia.org/wiki/Alertmanager#Onboard

Event Timeline

Change 725290 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: raise zuul queue alarm

https://gerrit.wikimedia.org/r/725290

Change 725294 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] alertmanager: add release engineering team

https://gerrit.wikimedia.org/r/725294

Change 725294 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: add release engineering team

https://gerrit.wikimedia.org/r/725294

Change 725290 merged by Giuseppe Lavagetto:

[operations/puppet@production] zuul: raise zuul queue alarm

https://gerrit.wikimedia.org/r/725290

hashar changed the task status from Open to Stalled.Oct 1 2021, 11:25 AM

The alert defined in Grafana should now emit to alert manager and result in a notification on IRC. Once that is validated to be working (and it should), we can drop in Puppet monitoring::graphite_threshold{ 'zuul_gearman_wait_queue':.

Marked stalled until confirmed.

hashar triaged this task as Medium priority.
hashar moved this task from INBOX to Doing on the Release-Engineering-Team board.

It is theoretically solved, will reopen if it is not breaking.

AlertManager hasn't issued a notification on IRC. Maybe it needs an extra configuration to join our #wikimedia-releng channel or the notification got shallowed / did not match in the routing system.

Context:

We had an alert on 2021-10-07 22:20:06 with Grafana flagging the waiting functions at 630 (above the 500 threshold).

I got an email from Icinga and IRC had:

22:27:34 <icinga-wm> PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1

The Grafana alarm should be tagged with team: releng.

hashar removed hashar as the assignee of this task.EditedOct 11 2021, 11:14 AM
hashar added a project: observability.

+ observability since I could not find a trace of the alert in Alert Manager. I have asked the team over IRC.

+ observability since I could not find a trace of the alert in Alert Manager. I have asked the team over IRC.

The alert must be sent to "alertmanager", and must have tag and severity tags set as mentioned in https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts

2021-10-11-144624_568x262_scrot.png (262×568 px, 18 KB)

hashar changed the task status from Open to Stalled.Oct 11 2021, 1:08 PM

Set. We will see next time it triggers. Thank you.

BTW, I think this fired today in the releng IRC channel?

17:15:44	<jinxer-wm>	(Queue (Jenkins jobs + Zuul functions) alert) firing: Queue (Jenkins jobs + Zuul functions) alert   - https://alerts.wikimedia.org
17:25:43	<jinxer-wm>	(Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert   - https://alerts.wikimedia.org
17:45:43	<jinxer-wm>	(Queue (Jenkins jobs + Zuul functions) alert) resolved: Queue (Jenkins jobs + Zuul functions) alert   - https://alerts.wikimedia.org
hashar changed the task status from Stalled to Open.Oct 20 2021, 5:38 PM

Thank you @CDanis for noticing! Indeed it did trigger and again today:

16:56:40 <jinxer-wm> (Queue (Jenkins jobs + Zuul functions) alert) firing: Queue (Jenkins jobs + Zuul functions) alert   - https://alerts.wikimedia.org
17:01:40 <jinxer-wm> (Queue (Jenkins jobs + Zuul functions) alert) firing: (2) Queue (Jenkins jobs + Zuul functions) alert   - https://alerts.wikimedia.org

Not sure why the Icinga one hasn't showed up, then it is slightly different. I will look at improving the alert based on https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (notably adding a link back to the dashboard) and possibly an email notification.

Change 738381 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] alertmanager: send releng alerts to both irc and mail

https://gerrit.wikimedia.org/r/738381

Change 738381 merged by Btullis:

[operations/puppet@production] alertmanager: send releng alerts to both irc and mail

https://gerrit.wikimedia.org/r/738381

hashar added a subscriber: BTullis.

@BTullis gave me a few more explanations about the various notifications mechanism available, notably having different priorities triggering different kind of alarms (filing a Phabricator task, sms notifications via VictorOps etc). That is good to know for later.

For the scope of this task, irc + email is good. We shall see later on whether we can add more alarming for various bits of our infra :]