node-exporter syslog spam filling up centrallog
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Jul 28 2021, 6:56 AM

Description

Noticed today while another incident was in progress, centrallog1001 filled up its disk due to syslog spam from node exporter:

Jul 28 06:49:20 cloudvirt1023 prometheus-node-exporter[1193]: cloudvirt1023\" > untyped:<value:0.102 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1020.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.112 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephmon1001.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.165 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephmon1002.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.194 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephmon1003.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.166 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1001.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.065 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1002.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.101 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1003.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.072 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1004.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.074 > } was collected before with the same name and label values\n* [from Gatherer #2] collected metric \"node_ping_latency\" { label:<name:\"dst_host\" value:\"cloudcephosd1005.eqiad.wmnet\" > label:<name:\"src_host\" value:\"cloudvirt1023\" > untyped:<value:0.097 > } was collected before with the same name and

The problem is with node-pinger appending only values to its .prom file, and never replacing the file

Details

Subject	Repo	Branch	Lines +/-
Revert "prometheus: temp disable node-pinger"	operations/puppet	production	+1 -1
systemd.timer_service: fix missing exec_start_pre	operations/puppet	production	+3 -3
prometheus: temp disable node-pinger	operations/puppet	production	+1 -1

Customize query in gerrit

Event Timeline

fgiunchedi created this task.Jul 28 2021, 6:56 AM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJul 28 2021, 6:56 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 708462 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: temp disable node-pinger

https://gerrit.wikimedia.org/r/708462

gerritbot added a project: Patch-For-Review.Jul 28 2021, 7:01 AM

Change 708462 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: temp disable node-pinger

https://gerrit.wikimedia.org/r/708462

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:07:39Z] <godog> remove cloud*/syslog.log from centrallog2001 - T287559

Maintenance_bot removed a project: Patch-For-Review.Jul 28 2021, 7:10 AM

AFAICT the exec_start_pre option of systemd::timer::job is never rendered either in the .service or (which wouldn't work afaik) in the .timer units

dcaro subscribed.Jul 28 2021, 7:15 AM

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:15Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on 6 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:22Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 6 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:28Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on 40 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:37Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 40 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:46Z] <dcaro@cumin1001> START - Cookbook sre.hosts.downtime for 5:00:00 on 29 hosts with reason: T287559

Mentioned in SAL (#wikimedia-operations) [2021-07-28T07:20:56Z] <dcaro@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 29 hosts with reason: T287559

Change 708465 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] systemd.timer_service: fix missing exec_start_pre

https://gerrit.wikimedia.org/r/708465

gerritbot added a project: Patch-For-Review.Jul 28 2021, 7:25 AM

Change 708465 merged by David Caro:

[operations/puppet@production] systemd.timer_service: fix missing exec_start_pre

https://gerrit.wikimedia.org/r/708465

Change 708468 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] Revert \"prometheus: temp disable node-pinger\"

https://gerrit.wikimedia.org/r/708468

Change 708468 merged by David Caro:

[operations/puppet@production] Revert \"prometheus: temp disable node-pinger\"

https://gerrit.wikimedia.org/r/708468

Fix deployed and running:

root@cloudcephosd1001:~# wc /var/lib/prometheus/node.d/node_pinger.prom
  22   44 2046 /var/lib/prometheus/node.d/node_pinger.prom

root@cloudcephosd1001:~# systemctl start prometheus-node-pinger.service

root@cloudcephosd1001:~# wc /var/lib/prometheus/node.d/node_pinger.prom
  22   44 2046 /var/lib/prometheus/node.d/node_pinger.prom

Thanks @fgiunchedi !

dcaro closed this task as Resolved.Jul 28 2021, 7:40 AM

dcaro claimed this task.

Maintenance_bot removed a project: Patch-For-Review.Jul 28 2021, 8:10 AM

node-exporter syslog spam filling up centrallogClosed, ResolvedPublicActions

Description

Details

Event Timeline

node-exporter syslog spam filling up centrallog
Closed, ResolvedPublic
Actions