metricsinfra: Build out default alert rules
Open, MediumPublic
Actions

Assigned To

None

Authored By

	taavi
	Aug 4 2021, 7:59 PM

Description

The default rules currently only alert on instance down / puppet failure. I imagine we'll want to add other defaults as well to detect general malfunctions on the instances:

disk space
mail queue length
ntp time sync status
systemd unit monitoring? at least for some critical services
cpu / ram usage?
network usage?

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T205862 Expand modern metrics infrastructure coverage (2018-19 Q2 goal)
Resolved	colewhite	T183454 Deprovision Diamond collectors no longer in use
Resolved	MoritzMuehlenhoff	T210993 Deprecate Diamond collectors in Cloud VPS
Open	None	T336774 Current status of cloudmetrics and its components
Resolved	taavi	T326266 Remove the WMCS statsd/Graphite service
Open	dcaro	T313444 Streamline WMCS Alerting and Paging
Resolved	taavi	T317032 Remove Diamond?
Resolved	taavi	T264920 Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus
Open	None	T194333 [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants
Resolved	taavi	T266050 Build Prometheus service for use by all Cloud VPS projects and their instances
Open	None	T288168 metricsinfra: Build out default alert rules

Event Timeline

taavi created this task.Aug 4 2021, 7:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2021, 7:59 PM

taavi triaged this task as Medium priority.Aug 4 2021, 7:59 PM

taavi updated the task description. (Show Details)Aug 5 2021, 3:04 PM

taavi merged a task: T166845: monitor some things on all Cloud instances (discussion).Feb 7 2023, 6:44 PM

taavi added subscribers: Andrew, Paladox, gerritbot and 2 others.

metricsinfra: Build out default alert rulesOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

metricsinfra: Build out default alert rules
Open, MediumPublic
Actions

Related Objects
Search...