Page MenuHomePhabricator

Port postgresql replication check to Prometheus/Alertmanager
Open, Needs TriagePublicGoal

Description

Currently we deploy icinga checks for postgresql replication lag in postgresql::slave::monitoring:

    $icinga_command = "/usr/bin/check_postgres_hot_standby_delay \
--host=${pg_master},localhost --dbuser=${pg_user} \
--dbpass=${pg_password} --dbname=${pg_database} \
--warning=${warning} --critical=${critical}"

    nrpe::monitor_service { 'postgres-rep-lag':
        description  => $description,
        nrpe_command => $icinga_command,
        notes_url    => 'https://wikitech.wikimedia.org/wiki/Postgres#Monitoring',
        retries      => $retries,
    }

And have a wrapper script for check_postgres_hot_standby_delay that exports the metrics/result for prometheus:

file { '/usr/bin/prometheus_postgresql_replication_lag':
    owner   => 'root',
    group   => 'root',
    mode    => '0755',
    content => template('postgresql/prometheus/postgresql_replication_lag.sh.erb'),
}

The script though is executed periodically only by profile::maps::osm_replica (i.e. not all pg slaves have said metrics)

This task will track extending the replication lag monitoring to all pg slaves (currently: profile::puppetdb::database, profile::netbox::db and profile::maps::osm_replica) then retire the icinga check and deploy a prometheus/alertmanager generic alert instead.

! MIGRATION TABLE !

Migrated? (Y/N)TitleResource TypeCommandFileProfiles
Npostgres-rep-lagNrpe::Monitor_service/usr/bin/check_postgres_hot_standby_delaymodules/postgresql/manifests/slave/monitoring.pp:20profile::maps::osm_replica, profile::puppetdb::database, profile::netbox::db

Event Timeline

Also note that prometheus-postgres-exporter since 0.12.0 has gained support for replication monitoring. This is IMHO the proper solution, though it will require >= trixie unless we chose to backport the package instead: https://github.com/prometheus-community/postgres_exporter/blob/master/CHANGELOG.md#0120--2023-03-21
It is an option that can/should be considered I think since it would mean future-proofing the postgresql monitoring infrastructure.

Change #1155129 had a related patch set uploaded (by Tiziano Fogli; author: Tiziano Fogli):

[operations/puppet@production] monitoring services: add migration task T374839 to instances

https://gerrit.wikimedia.org/r/1155129

Change #1155129 merged by Tiziano Fogli:

[operations/puppet@production] monitoring services: add migration task T374839 to instances

https://gerrit.wikimedia.org/r/1155129

tappof changed the subtype of this task from "Task" to "Goal".Sep 2 2025, 1:34 PM

I've upgraded the prometheus postgres exporters across the fleet to the version from trixie which is capable of replica monitoring. I also gave prometheus pg_monitor privileges where called for by the updated exporter. Next will be sorting out replication alerts using the updated metrics.

postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps1011:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="eqiad", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps1012:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="eqiad", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps1013:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="eqiad", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps1014:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="eqiad", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps2011:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps2012:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps2013:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="maps", goarch="amd64", goos="linux", goversion="go1.24.4", instance="maps2014:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="misc", goarch="amd64", goos="linux", goversion="go1.24.4", instance="netbox-dev2003:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="misc", goarch="amd64", goos="linux", goversion="go1.24.4", instance="netboxdb1003:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="eqiad", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="misc", goarch="amd64", goos="linux", goversion="go1.24.4", instance="netboxdb2003:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="puppet", goarch="amd64", goos="linux", goversion="go1.24.4", instance="puppetdb1003:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="eqiad", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="debian/sid", cluster="puppet", goarch="amd64", goos="linux", goversion="go1.24.4", instance="puppetdb2003:9187", job="postgresql", prometheus="ops", revision="0.17.1-1+b4", site="codfw", tags="unknown", version="0.17.1"}
postgres_exporter_build_info{branch="master", cluster="misc", goarch="amd64", goos="linux", goversion="go1.24.5", instance="gitlab1003:9187", job="postgresql", prometheus="ops", revision="68c176b8833b7580bf847cecf60f8e0ad5923f9a", site="eqiad", tags="unknown", version="0.15.0"}
postgres_exporter_build_info{branch="master", cluster="misc", goarch="amd64", goos="linux", goversion="go1.24.5", instance="gitlab1004:9187", job="postgresql", prometheus="ops", revision="68c176b8833b7580bf847cecf60f8e0ad5923f9a", site="eqiad", tags="unknown", version="0.15.0"}
postgres_exporter_build_info{branch="master", cluster="misc", goarch="amd64", goos="linux", goversion="go1.24.5", instance="gitlab2002:9187", job="postgresql", prometheus="ops", revision="68c176b8833b7580bf847cecf60f8e0ad5923f9a", site="codfw", tags="unknown", version="0.15.0"}

Note gitlab was already at 0.15.0