Page MenuHomePhabricator

blackbox prober alerts should be more user-friendly & application-oriented
Open, MediumPublic

Description

Currently, an example blackbox prober failure looks something like this:

23:22:18 <+jinxer-wm> (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

  • All the links -- for both notes and graphs -- link to documentation about the network prober service itself, not to anything about the specific probe that failed for the specific application.
  • This suggests the responder should be debugging the prober, rather than the service being tested -- which is almost never the case for these kinds of alerts
    • Probably those links should instead be used for whatever other 'internal' alerts the blackbox prober service has regarding its own health
  • I would argue that the blackbox prober is fundamentally something provided 'as a service' to the rest of the team / something that supports multi-tenancy, but the current implementation doesn't reflect this
    • It would be a lot more friendly to service owners and oncall responders if the notes links and the graph links were customizable for each probe. Like in the example alert provided above, I would expect a link to the notes page about appservers, and to the Application Servers RED dashboard in Grafana.

Event Timeline

@CDanis Thank you for the task and notes, agreed there is room for improvement. Will discuss this with the rest of the team and propose some options for the task.

As one last note: it doesn't look like there's any documentation on Wikitech about how to create a new probe? Or at least, I couldn't find it with a quick search.

Thank you for the feedback @CDanis ! Reporting a discussion at the o11y team meeting: as a middle ground between implementation effort and user-friendliness, we were thinking of changing the runbook with a generic (for example) https://wikitech.wikimedia.org/wiki/Runbooks#<service name>.

Users can then create entries in the Runbooks page at the correct anchor, the service section will expand on what to do, add context, service-specific dashboards, and possibly other links based on the alert (we could also link to service name + alert name to get even more specific anchors). What do you think?

As one last note: it doesn't look like there's any documentation on Wikitech about how to create a new probe? Or at least, I couldn't find it with a quick search.

Agreed, I'll add wikitech documentation for both probes in service::catalog and the puppet-based checks and followup here!

That sounds like a good start, thanks @fgiunchedi !

Change 816135 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: update blackbox check alerts runbook link

https://gerrit.wikimedia.org/r/816135

Change 816136 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: link to service-specific Runbook wikitech page

https://gerrit.wikimedia.org/r/816136

As one last note: it doesn't look like there's any documentation on Wikitech about how to create a new probe? Or at least, I couldn't find it with a quick search.

Agreed, I'll add wikitech documentation for both probes in service::catalog and the puppet-based checks and followup here!

A sketch of documentation (with pointers to the puppet docs) is now live at https://wikitech.wikimedia.org/wiki/Prometheus#Network_probes_(blackbox_exporter) let me know what you think!

Change 816136 merged by Filippo Giunchedi:

[operations/alerts@master] sre: link to service-specific Runbook wikitech page

https://gerrit.wikimedia.org/r/816136

Change 816135 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: update blackbox check alerts runbook link

https://gerrit.wikimedia.org/r/816135

lmata triaged this task as Medium priority.Sep 13 2022, 1:35 AM