Page MenuHomePhabricator

Audit/Assess meta monitoring strategy
Open, MediumPublic


The current (Oct 2022) thinking for meta-monitoring (tools that check over the tools) is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple, I (Filippo) have suggested the following:

  • In this context, "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite).
  • We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked.
  • Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request.
  • The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability)

All that said, the vendor must be able to (at least):

  • Perform HTTP request from one/multiple locations
  • Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail)
  • Be able to send alerts to Splunk Oncall using the API
  • Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods

Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements:

  • Perform HTTP request from one (or ideally multiple) locations
  • Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time)
  • Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred)
  • We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead

Event Timeline

lmata triaged this task as Low priority.Sep 30 2021, 9:48 PM
lmata moved this task from Up next to In progress on the SRE Observability (FY2022/2023-Q2) board.

I'm going to draft out my thought process for this task; the plan is to compose a bit of research into prior and new options.

We already have base requirements outlined in the description. This category of products falls under synthetic monitoring. Although, however, the product features vary from tool to tool; I feel there are a couple of general levels of product offerings to help break down vendors:

  • uptime availability monitoring (pingdom/, which will do simple web requests, low cost, saas, globally distributed (within reason)
  • enterprise-grade, full-featured synthetic monitoring tools will support regex, simple conditionals, and possible js injection (new relic, Dynatrace, Datadog)
  • network monitoring tools with latency monitoring (Catchpoint, Kentik) will support network segment monitoring, last-mile latency, and some level of web request testing

Please look at the following for introductory material...



I also consider options based on the following additional factors as good-to-have features:

  • Uptime, availability
  • Pricing
  • Implementation overhead (time/materials)
  • Integration with existing tools (Splunk/email/webhooks)
  • Security* ( authentication/SSO)
lmata lowered the priority of this task from High to Medium.Oct 25 2022, 2:27 AM
lmata renamed this task from Audit/Assess external monitoring strategy to Audit/Assess meta monitoring strategy.Oct 26 2022, 2:11 PM
lmata updated the task description. (Show Details)
lmata removed lmata as the assignee of this task.Nov 17 2023, 6:51 PM