Page MenuHomePhabricator

Prometheus1005 out of disk on /
Closed, ResolvedPublic

Description

Appears to be mostly a lot of logs in /var/log/syslog from prometheus-pushgateway

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2025-06-27T23:21:09Z] <cwhite> truncate /var/log/syslog on prometheus1005 T398091

Change #1164862 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: split pushgateway logs

https://gerrit.wikimedia.org/r/1164862

Change #1164862 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: split pushgateway logs

https://gerrit.wikimedia.org/r/1164862

Mentioned in SAL (#wikimedia-operations) [2025-06-30T07:22:02Z] <godog> bounce prometheus-pushgateway on prometheus1005 - T398091

Mentioned in SAL (#wikimedia-operations) [2025-06-30T07:41:41Z] <godog> restart prometheus-pushgateway on prometheus1005 with fresh state - T398091

Thank you for taking a look @colewhite ! I have removed pgw state ( rm /var/lib/prometheus/pushgateway.data ) and started the pgw again to clear the existing metrics. In other words new pushes won't conflict again. Of course this is not ideal and my understanding is that newer (trixie) versions of pgw did fix this logging (to be investigated)

Change #1165020 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] pushgateway: rotate logs hourly

https://gerrit.wikimedia.org/r/1165020

Change #1165020 merged by Tiziano Fogli:

[operations/puppet@production] pushgateway: rotate logs hourly

https://gerrit.wikimedia.org/r/1165020

To avoid future issues, Pushgateway now writes logs to a separate log file, managed by a dedicated logrotate rule based on file size (maximum 1 GB) and executed hourly.

tappof claimed this task.