Page MenuHomePhabricator

node-exporter syslog spam filling up centrallog
Closed, ResolvedPublic

Description

Noticed today while another incident was in progress, centrallog1001 filled up its disk due to syslog spam from node exporter:

Jul 28 06:49:20 cloudvirt1023 prometheus-node-exporter[1193]: cloudvirt1023\" > untyped:<value:0.102 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1020.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.112 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephmon1001.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.165 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephmon1002.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.194 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephmon1003.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.166 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1001.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.065 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1002.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.101 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1003.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.072 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1004.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.074 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1005.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.097 > } was collected before with the same name and

The problem is with node-pinger appending only values to its .prom file, and never replacing the file

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 708462 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: temp disable node-pinger

https://gerrit.wikimedia.org/r/708462

Change 708462 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: temp disable node-pinger

https://gerrit.wikimedia.org/r/708462

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:07:39Z] <godog> remove cloud*/syslog.log from centrallog2001 - T287559

AFAICT the exec_start_pre option of systemd::timer::job is never rendered either in the .service or (which wouldn't work afaik) in the .timer units

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:15Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on 6 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:22Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 6 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:28Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on 40 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:37Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 40 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:46Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on 29 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:56Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 29 hosts with reason: T287559

Change 708465 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] systemd.timer_service: fix missing exec_start_pre

https://gerrit.wikimedia.org/r/708465

Change 708465 merged by David Caro:

[operations/puppet@production] systemd.timer_service: fix missing exec_start_pre

https://gerrit.wikimedia.org/r/708465

Change 708468 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] Revert \"prometheus: temp disable node-pinger\"

https://gerrit.wikimedia.org/r/708468

Change 708468 merged by David Caro:

[operations/puppet@production] Revert \"prometheus: temp disable node-pinger\"

https://gerrit.wikimedia.org/r/708468

Fix deployed and running:

root@cloudcephosd1001:~# wc /var/lib/prometheus/node.d/node_pinger.prom
  22   44 2046 /var/lib/prometheus/node.d/node_pinger.prom

root@cloudcephosd1001:~# systemctl start prometheus-node-pinger.service

root@cloudcephosd1001:~# wc /var/lib/prometheus/node.d/node_pinger.prom
  22   44 2046 /var/lib/prometheus/node.d/node_pinger.prom

Thanks @fgiunchedi !

dcaro claimed this task.