The current (Oct 2022) thinking for meta-monitoring (tools that check over the tools) is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple, I (Filippo) have suggested the following:
- In this context, "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite).
- We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked.
- Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request.
- The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability)
All that said, the vendor must be able to (at least):
- Perform HTTP request from one/multiple locations
- Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail)
- Be able to send alerts to Splunk Oncall using the API
- Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods
Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements:
- Perform HTTP request from one (or ideally multiple) locations
- Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time)
- Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred)
- We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead