Audit/Assess meta monitoring strategy
Open, MediumPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	May 4 2021, 10:22 AM

Description

The current (Oct 2022) thinking for meta-monitoring (tools that check over the tools) is to evaluate and then select an external vendor. To ease the selection, avoid too much lock-in, and keep things simple, I (Filippo) have suggested the following:

In this context, "external monitoring" is considered meta-monitoring of functionality/availability of alerting and metrics systems (i.e. icinga, alertmanager, prometheus, graphite).
We want to keep the external "surface" to be checked to a minimum, to this end we'll be exposing one/multiple endpoints to be checked.
Said endpoints are deployed to production and contain the logic to perform sanity/availability checking (e.g. reach out to icinga, etc). This way we're sidestepping the whole issue on deploying code/complex logic outside of production, and the vendor logic is limited to an HTTP request.
The vendor checks said endpoints and alerts SRE if something goes wrong (e.g. error status, unreachability)

All that said, the vendor must be able to (at least):

Perform HTTP request from one/multiple locations
Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail)
Be able to send alerts to Splunk Oncall using the API
Provide an authenticated API to downtime/silence alerts. We'll use this API to silence alerts during expected maintenance periods

Those are the ideal requirements in my (Filippo's) opinion, however we could get away with even a smaller set of requirements:

Perform HTTP request from one (or ideally multiple) locations
Alert if said requests fail (e.g. one/two in a row, and/or if multiple locations fail at the same time)
Send emails towards Splunk Oncall using the API (emails would be acceptable but less preferred)
We'll handle silencing by setting the related "routing key" to maintenance mode via the Splunk Oncall API instead

Related Objects

Mentioned In: T97099: Squeeze value out of external monitoring services
Mentioned Here: T85829: Identify deficiencies in Nimsoft Cloud User Experience Monitor (formerly WatchMouse)
T97099: Squeeze value out of external monitoring services
T292603: DX App Synthetic Monitoring App - watchmouse alert flapping due to CA expiration
T299147: Retire WatchMouse (CA DX APP)

Event Timeline

fgiunchedi created this task.May 4 2021, 10:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 4 2021, 10:22 AM

fgiunchedi added a subscriber: User-fgiunchedi.May 4 2021, 10:22 AM

• taavi added a project: User-fgiunchedi.May 4 2021, 10:24 AM

• taavi removed a subscriber: User-fgiunchedi.

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.May 6 2021, 8:48 AM

fgiunchedi moved this task from Inbox to In progress on the observability board.May 10 2021, 3:24 PM

lmata edited projects, added SRE Observability (FY2021/2022-Q1); removed observability.Jul 12 2021, 2:20 AM

lmata moved this task from Inbox to In progress on the SRE Observability (FY2021/2022-Q1) board.

fgiunchedi edited projects, added SRE Observability; removed SRE Observability (FY2021/2022-Q1).Aug 2 2021, 9:06 AM

fgiunchedi moved this task from Doing to Backlog on the User-fgiunchedi board.Aug 2 2021, 9:32 AM

lmata edited projects, added SRE Observability (FY2021/2022-Q2); removed SRE Observability.Aug 3 2021, 3:33 PM

fgiunchedi removed a project: User-fgiunchedi.Aug 3 2021, 3:53 PM

lmata triaged this task as Low priority.Sep 30 2021, 9:48 PM

scheduling

lmata edited projects, added Observability-Alerting; removed SRE Observability (FY2021/2022-Q3).Mar 25 2022, 2:04 PM

lmata updated the task description. (Show Details)

fgiunchedi updated the task description. (Show Details)Oct 20 2022, 9:33 AM

fgiunchedi updated the task description. (Show Details)Oct 21 2022, 8:33 AM

lmata edited projects, added SRE Observability (FY2022/2023-Q2); removed Observability-Alerting.Oct 24 2022, 5:36 AM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2022/2023-Q2) board.

I'm going to draft out my thought process for this task; the plan is to compose a bit of research into prior and new options.

Methodology:
We already have base requirements outlined in the description. This category of products falls under synthetic monitoring. Although, however, the product features vary from tool to tool; I feel there are a couple of general levels of product offerings to help break down vendors:

uptime availability monitoring (pingdom/uptime.com/statuscake), which will do simple web requests, low cost, saas, globally distributed (within reason)
enterprise-grade, full-featured synthetic monitoring tools will support regex, simple conditionals, and possible js injection (new relic, Dynatrace, Datadog)
network monitoring tools with latency monitoring (Catchpoint, Kentik) will support network segment monitoring, last-mile latency, and some level of web request testing

Please look at the following for introductory material...

Sources:

History:

I also consider options based on the following additional factors as good-to-have features:

Uptime, availability
Pricing
Implementation overhead (time/materials)
Integration with existing tools (Splunk/email/webhooks)
Security* ( authentication/SSO)

lmata lowered the priority of this task from High to Medium.Oct 25 2022, 2:27 AM

lmata renamed this task from Audit/Assess external monitoring strategy to Audit/Assess meta monitoring strategy.Oct 26 2022, 2:11 PM

lmata updated the task description. (Show Details)

lmata edited projects, added SRE Observability (FY2022/2023-Q3); removed SRE Observability (FY2022/2023-Q2).Jan 12 2023, 12:18 AM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2022/2023-Q3) board.Jan 12 2023, 8:18 PM

lmata edited projects, added Observability-Alerting; removed SRE Observability (FY2022/2023-Q3).May 2 2023, 1:37 PM

lmata removed lmata as the assignee of this task.Nov 17 2023, 6:51 PM

Audit/Assess meta monitoring strategyOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Audit/Assess meta monitoring strategy
Open, MediumPublic
Actions