Page MenuHomePhabricator

Reformat IRC alerts to be more useful
Open, Needs TriagePublic

Description

14:58:15	<+jinxer-wm>	(JobrunnerPHPBusyWorkers) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers
14:59:10	<akosiaris>	can't the "resolved" and firing be the first thing in those messages ^ and in caps ? 
14:59:28	<akosiaris>	it would make my IRC life a tag easier

Proposal:

14:58:15	<+jinxer-wm>	RESOLVED (JobrunnerPHPBusyWorkers): Not enough idle php7.4-fpm.service workers for Mediawiki jobrunner at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&viewPanel=54 - https://alerts.wikimedia.org/?q=alertname%3DJobrunnerPHPBusyWorkers

Event Timeline

While considering this I'd also like to propose moving the (alert name) to the end of message at the same time. For example:

Original:

(ProbeDown) firing: Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - <links>

(ProbeDown) resolved: Service jobrunner:443 has failed probes (http_jobrunner_ip4) #page  - <links>

Proposed:

FIRING: Service jobrunner:443 has failed probes (http_jobrunner_ip4)(ProbeDown) #page - <links>

RESOLVED: Service jobrunner:443 has failed probes (http_jobrunner_ip4)(ProbeDown) #page - <links>

I think beginning with human readable summary is easier to read at a glance, and config keywords would still be included at the end along with links and hashtags.

I'm +1 on moving resolved / firing to the beginning and see what the feedback is.

re: moving the alert name to the end I don't think that'll help readability because the alertname IMO should give operators already a broad idea of what's happening at a glance.

I'll send out patches next week to swap resolved/firing at the beginning

Thanks for tackling this!

Regarding moving the alert name to the end proposal, to me at least, it looks better. I prefer the order of

  1. state (failure/recovery)
  2. subject (what is failing/recovering)
  3. mode (how exactly is it failing/recovering)

it allows me to assess faster the extent of the problem and the possible impact (e.g. if the alert is about something that would impact end users or not)

Change #1019829 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: tweak irc alert message format

https://gerrit.wikimedia.org/r/1019829

Change #1019840 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] alertmanager: irc: move group name after summary and clarify count

https://gerrit.wikimedia.org/r/1019840

Change #1019844 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] alertmanager: irc: remove runbook and dashboard links from irc alerts

https://gerrit.wikimedia.org/r/1019844

My take to have alert group (i.e. the alert name) at the beginning is the following:

  • alerts.w.o is "keyed" to alert groups, not individual alerts
  • the optional number of alerts that are firing refers to the alert group as a whole, e.g. I find it confusing to have an alert count next to the individual alert (FIRING [4x] <summary> (<alert group>))
  • the alert group name already gives a broad indication of what's wrong

FWIW I think the current alert text makes sense based on the premise that all alert recipients will/should know about how alerting system internals are structured.

But overall I'm wondering a) if that premise is true across the board, and b) if not, could/should we change the priority from presenting group/key first to ordering from most to least specific info (e.g. <verb> <summary> <group>)

Change #1019829 abandoned by Filippo Giunchedi:

[operations/puppet@production] alertmanager: tweak irc alert message format

Reason:

Superseded by I39617c18921 and followups

https://gerrit.wikimedia.org/r/1019829