Currently, an example blackbox prober failure looks something like this:
23:22:18 <+jinxer-wm> (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
- All the links -- for both notes and graphs -- link to documentation about the network prober service itself, not to anything about the specific probe that failed for the specific application.
- This suggests the responder should be debugging the prober, rather than the service being tested -- which is almost never the case for these kinds of alerts
- Probably those links should instead be used for whatever other 'internal' alerts the blackbox prober service has regarding its own health
- I would argue that the blackbox prober is fundamentally something provided 'as a service' to the rest of the team / something that supports multi-tenancy, but the current implementation doesn't reflect this
- It would be a lot more friendly to service owners and oncall responders if the notes links and the graph links were customizable for each probe. Like in the example alert provided above, I would expect a link to the notes page about appservers, and to the Application Servers RED dashboard in Grafana.