Change Details

The current (Oct 2022) thinking for external monitoring is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple I (Filippo) have suggested the following: * In this context "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite). * We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked. * Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request. * The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability) All that said, the vendor must be able to (at least): * [ ] Perform HTTP request from one/multiple locations * [ ] Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail) * [ ] Be able to send alerts to Splunk Oncall using the API * [ ] Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements: * [ ] Perform HTTP request from one (or ideally multiple) locations * [ ] Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time) * [ ] Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred) * [ ] We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead