Page MenuHomePhabricator

Include #_page on host alerts that page SRE
Closed, ResolvedPublic

Description

As part of T228878: Reduce Icinga alert noise we've introduced #page to tag paging alerts as such, however this applies only to Icinga services, not hosts, whereas also hosts should have #page where applicable.

The fix isn't immediately straightforward as we did for services (i.e. tweaking the description as needed) because there's no description for hosts. Possibly using host alias will work though.

Event Timeline

Change 735695 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] icinga: use display_name for a HOST to add '#page' string where applicable

https://gerrit.wikimedia.org/r/735695

Change 735696 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] exported_nagios_host: add the display_name parameter

https://gerrit.wikimedia.org/r/735696

Change 735696 abandoned by Dzahn:

[operations/puppet@production] exported_nagios_host: add the display_name parameter

Reason:

merging into Iefb826e8005a93acc32

https://gerrit.wikimedia.org/r/735696

Legoktm renamed this task from Include #page on host alerts that page SRE to Include #_page on host alerts that page SRE.Oct 29 2021, 9:08 PM

on gerrit:735695, @herron said: (saving here so discussion can continue off-gerrit)

AIUI display_name may be used in the CGI, but it would not be output in alerts since we don't currently use $HOSTDISPLAYNAME$ in notification commands
Today we have e.g.

command_line echo "$NOTIFICATIONTYPE$ - Host $HOSTALIAS$ is $HOSTSTATE$: $HOSTOUTPUT$ $HOSTACKAUTHOR$ $HOSTACKCOMMENT$" >> /var/log/icinga/irc.log

As-is I think switching from display_name to alias would do the trick for inserting the page hashtag using the current notification commands, and since service >commands use the $HOSTNAME$ macro that shouldn't introduce the hashtag on non-critical service alerts on critical hosts.

An alternative to consider is using a combination of HOSTNOTES, and HOSTNOTESURL along with updates to the notification commands. Something like
HOSTNOTES containing the hashtag string when critical. That would be more involved but would have the benefit of supporting links to runbooks and graphs
for critical hosts.

Change 735695 abandoned by Dzahn:

[operations/puppet@production] icinga: use display_name for a HOST to add 'page' string where applicable

Reason:

https://phabricator.wikimedia.org/T236379#7517716

https://gerrit.wikimedia.org/r/735695

We have tried to fix this with the alias parameter with: https://gerrit.wikimedia.org/r/c/operations/puppet/+/799903
But we had to revert it because the alias is exported into puppetdb as a list:

[...SNIP...]
"parameters":
    {
        "alias": ["db1173 #page"],
        "ensure": "present",
[...SNIP...]

Because alias is also a Puppet metaparameter and it's almost a reserved word for parameters, it's not clear if this is an unexpected behaviour that we should not rely on or just "fix" it tweaking naggen2 during the Icinga config generation.
Adding @jbond that helped me on the change.

One option that comes to mind is to do the whole change in naggen2, if the "paging" contact is present add the alias with the #page hashtag.
Others seems a bit more convoluted, like using the display_name parameter, that AFAICT is not exposed as a macro, and then if that's set in naggen2 add an alias with the same content of it.

Change 801388 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] naggen2: inject # page alias for critical hosts

https://gerrit.wikimedia.org/r/801388

One option that comes to mind is to do the whole change in naggen2, if the "paging" contact is present add the alias with the #page hashtag.

This seems like a reasonable approach to me i have created a quick CR

Change 801388 merged by Jbond:

[operations/puppet@production] naggen2: inject # page alias for critical hosts

https://gerrit.wikimedia.org/r/801388

Change 801648 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] naggen: only apply alias injection for hosts

https://gerrit.wikimedia.org/r/801648

Change 801648 merged by Jbond:

[operations/puppet@production] naggen: only apply alias injection for hosts

https://gerrit.wikimedia.org/r/801648

fgiunchedi claimed this task.

This has been implemented! Thank you @Volans and @jbond