Page MenuHomePhabricator

Audit/Assess meta monitoring strategy
Open, MediumPublic

Description

The current (Oct 2022) thinking for meta-monitoring (tools that check over the tools) is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple, I (Filippo) have suggested the following:

  • In this context, "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite).
  • We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked.
  • Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request.
  • The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability)

Endpoints have been configured within the task T397003: Configure a prometheus dead man's snitch alert.
HetrixTools is the external monitoring tool identified to query these endpoints.

All that said, the vendor must be able to (at least):

  • Perform HTTP request from one/multiple locations
  • Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail)
  • Be able to send alerts to Splunk Oncall using the API (--> it supports also pager duty)
  • Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods (--> API provided, but the functionality on the WMF side still needs to be implemented.)

Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements:

  • Perform HTTP request from one (or ideally multiple) locations
  • Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time)
  • Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred)
  • We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead

Event Timeline

lmata moved this task from Up next to In progress on the SRE Observability (FY2022/2023-Q2) board.

I'm going to draft out my thought process for this task; the plan is to compose a bit of research into prior and new options.

Methodology:
We already have base requirements outlined in the description. This category of products falls under synthetic monitoring. Although, however, the product features vary from tool to tool; I feel there are a couple of general levels of product offerings to help break down vendors:

  • uptime availability monitoring (pingdom/uptime.com/statuscake), which will do simple web requests, low cost, saas, globally distributed (within reason)
  • enterprise-grade, full-featured synthetic monitoring tools will support regex, simple conditionals, and possible js injection (new relic, Dynatrace, Datadog)
  • network monitoring tools with latency monitoring (Catchpoint, Kentik) will support network segment monitoring, last-mile latency, and some level of web request testing

Please look at the following for introductory material...

Sources:

History:

I also consider options based on the following additional factors as good-to-have features:

  • Uptime, availability
  • Pricing
  • Implementation overhead (time/materials)
  • Integration with existing tools (Splunk/email/webhooks)
  • Security* ( authentication/SSO)
lmata lowered the priority of this task from High to Medium.Oct 25 2022, 2:27 AM
lmata renamed this task from Audit/Assess external monitoring strategy to Audit/Assess meta monitoring strategy.Oct 26 2022, 2:11 PM
lmata updated the task description. (Show Details)

The HetrixTools related "Business Arrangement Request Form v16.3" (#7716) has been submitted on coupa.

The business arrangement request has been approved.

I’ve pushed the credentials to pwstore, set up the targets on HetrixTools, and ran some tests using a dedicated rotation/escalation policy/route set.
Everything worked fine.
Currently, the targets are in maintenance mode without notifications, but everything else is ready to page on-callers.

tappof added a subtask: Unknown Object (Task).Nov 13 2025, 2:44 PM
Andrew changed the status of subtask Unknown Object (Task) from Stalled to Open.Jan 7 2026, 8:34 PM