Page MenuHomePhabricator

Stop exporting unit state metrics for timers
Closed, ResolvedPublic

Description

node_systemd_unit_state grew to be one of the big metrics, I've taken a quick look and I don't think having 5 metrics per each timer is useful (nor used). We're interested in failed units, and when a timer "fails" is usually the underlying .service unit we're interested in

For example:

node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="activating"}	0
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="active"}	1
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="deactivating"}	0
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="failed"}	0
node_systemd_unit_state{cluster="bastion",instance="bast6003:9100",job="node",name="apt-daily-upgrade.timer",site="drmrs",state="inactive"}	0

The .timer metrics alone represent 22% of node_systemd_unit_state in drmrs (see https://w.wiki/8FzF). I've picked a small site on purpose as these are expensive queries to run e.g. in codfw/eqiad

Event Timeline

Change 977733 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: re-introduce distro-specific node-exporter arguments

https://gerrit.wikimedia.org/r/977733

Change 977734 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: exclude timer units from systemd collector

https://gerrit.wikimedia.org/r/977734

Change 977733 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: re-introduce distro-specific node-exporter arguments

https://gerrit.wikimedia.org/r/977733

Change 977734 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: exclude timer units from systemd collector

https://gerrit.wikimedia.org/r/977734

fgiunchedi claimed this task.

This is done, resulting in ~4% less samples ingested fleet-wide