Page MenuHomePhabricator

Audit/Assess meta monitoring strategy
Open, MediumPublic

Description

The current (Oct 2022) thinking for meta-monitoring (tools that check over the tools) is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple, I (Filippo) have suggested the following:

  • In this context, "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite).
  • We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked.
  • Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request.
  • The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability)

All that said, the vendor must be able to (at least):

  • Perform HTTP request from one/multiple locations
  • Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail)
  • Be able to send alerts to Splunk Oncall using the API
  • Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods

Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements:

  • Perform HTTP request from one (or ideally multiple) locations
  • Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time)
  • Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred)
  • We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead

Event Timeline

lmata triaged this task as Low priority.Sep 30 2021, 9:48 PM
lmata moved this task from Up next to In progress on the SRE Observability (FY2022/2023-Q2) board.

I'm going to draft out my thought process for this task; the plan is to compose a bit of research into prior and new options.

Methodology:
We already have base requirements outlined in the description. This category of products falls under synthetic monitoring. Although, however, the product features vary from tool to tool; I feel there are a couple of general levels of product offerings to help break down vendors:

  • uptime availability monitoring (pingdom/uptime.com/statuscake), which will do simple web requests, low cost, saas, globally distributed (within reason)
  • enterprise-grade, full-featured synthetic monitoring tools will support regex, simple conditionals, and possible js injection (new relic, Dynatrace, Datadog)
  • network monitoring tools with latency monitoring (Catchpoint, Kentik) will support network segment monitoring, last-mile latency, and some level of web request testing

Please look at the following for introductory material...

Sources:

History:

I also consider options based on the following additional factors as good-to-have features:

  • Uptime, availability
  • Pricing
  • Implementation overhead (time/materials)
  • Integration with existing tools (Splunk/email/webhooks)
  • Security* ( authentication/SSO)
lmata lowered the priority of this task from High to Medium.Oct 25 2022, 2:27 AM
lmata renamed this task from Audit/Assess external monitoring strategy to Audit/Assess meta monitoring strategy.Oct 26 2022, 2:11 PM
lmata updated the task description. (Show Details)
lmata removed lmata as the assignee of this task.Nov 17 2023, 6:51 PM