Maniphest T351936

Stop exporting unit state metrics for timers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Nov 24 2023, 3:49 PM

Description

node_systemd_unit_state grew to be one of the big metrics, I've taken a quick look and I don't think having 5 metrics per each timer is useful (nor used). We're interested in failed units, and when a timer "fails" is usually the underlying .service unit we're interested in

For example:

node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="activating"}	0
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="active"}	1
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="deactivating"}	0
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="failed"}	0
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="inactive"}	0

The .timer metrics alone represent 22% of node_systemd_unit_state in drmrs (see https://w.wiki/8FzF). I've picked a small site on purpose as these are expensive queries to run e.g. in codfw/eqiad

Details

	Subject	Repo	Branch	Lines +/-
	prometheus: exclude timer units from systemd collector	operations/puppet	production	+3 -0
	prometheus: re-introduce distro-specific node-exporter arguments	operations/puppet	production	+43 -8

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T351935 Audit Prometheus metrics size/label values
		Resolved		fgiunchedi	T351936 Stop exporting unit state metrics for timers

Event Timeline

fgiunchedi created this task.Nov 24 2023, 3:49 PM

fgiunchedi added a project: User-fgiunchedi.Nov 27 2023, 8:42 AM

Change 977733 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: re-introduce distro-specific node-exporter arguments

https://gerrit.wikimedia.org/r/977733

Change 977734 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: exclude timer units from systemd collector

https://gerrit.wikimedia.org/r/977734

Change 977733 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: re-introduce distro-specific node-exporter arguments

https://gerrit.wikimedia.org/r/977733

Change 977734 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: exclude timer units from systemd collector

https://gerrit.wikimedia.org/r/977734

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2023, 2:10 PM

This is done, resulting in ~4% less samples ingested fleet-wide

lmata added a project: SRE Observability (FY2023/2024-Q2).Dec 5 2023, 4:24 PM

lmata moved this task from Inbox to Done on the SRE Observability (FY2023/2024-Q2) board.Jan 26 2024, 1:08 AM

Stop exporting unit state metrics for timersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Stop exporting unit state metrics for timers
Closed, ResolvedPublic
Actions

Related Objects
Search...