The current (Oct 2022) thinking for external monitoring is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple I (Filippo) have suggested the following:
* In this context "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite).
* We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked.
* Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request.
* The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability)
All that said, the vendor must be able to (at least):
* [ ] Perform HTTP request from one/multiple locations
* [ ] Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail)
* [ ] Be able to send alerts to Splunk Oncall using the API
* [ ] Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods
Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements:
* [ ] Perform HTTP request from one (or ideally multiple) locations
* [ ] Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time)
* [ ] Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred)
* [ ] We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead