Page MenuHomePhabricator

Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent
Open, Needs TriagePublic

Description

Gobblin metrics pushing to pushgateway are generating tens of GB per day. On 2025-06-27, prometheus1005 ran out of disk space because of the volume of these logs.

prometheus-pushgateway
{
  "ts": "2025-06-27T00:00:37.679Z",
  "caller": "diskmetricstore.go:166",
  "level": "info",
  "msg": "metric families inconsistent help strings",
  "err": "Metric families have inconsistent help strings. The latter will have priority. This is bad. Fix your pushed metrics!",
  "new": {
    "name": "gobblin_kafka_extracted_records_expected_count",
    "help": "Number of records expected to be extracted. This is expected high watermark - actual low watermark ",
    "type": "GAUGE",
    "metric": [
      [
        {
          "label": {
            "name": "instance",
            "value": ""
          }
        },
        {
          "label": {
            "name": "job",
            "value": "webrequest"
          }
        },
        {
          "label": {
            "name": "kafka_partition",
            "value": "20"
          }
        },
        {
          "label": {
            "name": "kafka_topic",
            "value": "webrequest_text"
          }
        },
        {
          "label": {
            "name": "reporter_type",
            "value": "EVENT"
          }
        },
        {
          "gauge": {
            "value": "1.0591e+06"
          }
        }
      ]
    ]
  },
  "old": {
    "name": "gobblin_kafka_extracted_records_expected_count",
    "help": "Number of records expected to be extracted. This is expected high watermark - actual low watermark",
    "type": "GAUGE",
    "metric": [
      [
        {
          "label": {
            "name": "instance",
            "value": ""
          }
        },
        {
          "label": {
            "name": "job",
            "value": "eventlogging_legacy"
          }
        },
        {
          "label": {
            "name": "kafka_partition",
            "value": "0"
          }
        },
        {
          "label": {
            "name": "kafka_topic",
            "value": "eventlogging_MediaWikiPingback"
          }
        },
        {
          "label": {
            "name": "reporter_type",
            "value": "EVENT"
          }
        },
        {
          "gauge": {
            "value": "27224"
          }
        }
      ],
      [
        {
          "label": {
            "name": "instance",
            "value": ""
          }
        },
        {
          "label": {
            "name": "job",
            "value": "webrequest_frontend"
          }
        },
        {
          "label": {
            "name": "kafka_partition",
            "value": "255"
          }
        },
        {
          "label": {
            "name": "kafka_topic",
            "value": "webrequest_frontend_text"
          }
        },
        {
          "label": {
            "name": "reporter_type",
            "value": "EVENT"
          }
        },
        {
          "gauge": {
            "value": "886341"
          }
        }
      ]
    ]
  }
}

Details

Event Timeline

Change #1164862 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: split pushgateway logs

https://gerrit.wikimedia.org/r/1164862

Change #1164862 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: split pushgateway logs

https://gerrit.wikimedia.org/r/1164862

This was caused by a single trailing whitespace afaics (!)

"help": "Number of records expected to be extracted. This is expected high watermark - actual low watermark ",

vs

"help": "Number of records expected to be extracted. This is expected high watermark - actual low watermark",

Definitely a problem in itself the conflict, we'll be addressing the operational problem in parent task -- in the meantime I've reset pushgateway state so new pushes don't conflict anymore

Filippo has removed pgw state ( rm /var/lib/prometheus/pushgateway.data ) and started the pgw again to clear the existing metrics. In other words new pushes won't conflict again. Of course this is not ideal: maybe newer (trixie) versions of pgw did fix this logging (to be investigated).

What's the latest status with this? Is there still an issue with gobblin metrics in particular, or is everything resolved? Thanks.

The issue was resolved after deleting the Pushgateway data.
However, the task remains valid to verify whether the Trixie version of Pushgateway fixes the issue without requiring data deletion.

The issue was resolved after deleting the Pushgateway data.
However, the task remains valid to verify whether the Trixie version of Pushgateway fixes the issue without requiring data deletion.

OK, great. Thanks for that explanation. Would it be best to tag observability instead of Data-Platform-SRE - if you're planning to carry out the work?
If you would like us to do it, then let's discuss. Thanks.

Gehel added a project: observability.
Gehel subscribed.

Moving this to watching as I'm assuming that observability is moving this forward.

hnowlan claimed this task.
hnowlan subscribed.

Resolving this issue for now, in order to track work elsewhere.

hnowlan edited projects, added SRE Observability; removed observability, Data-Platform-SRE.

Reopening, moving to o11y backlog.

hnowlan renamed this task from Inconsistent Prometheus metrics generating many logs to Ensure pushgateway 1.11.0 avoids log spam when metric help strings are inconsistent.Sep 10 2025, 2:45 PM
hnowlan removed hnowlan as the assignee of this task.
hnowlan moved this task from Inbox to FY2025/2026-Q2 on the SRE Observability board.