Currently we deploy icinga checks for postgresql replication lag in postgresql::slave::monitoring:
$icinga_command = "/usr/bin/check_postgres_hot_standby_delay \
--host=${pg_master},localhost --dbuser=${pg_user} \
--dbpass=${pg_password} --dbname=${pg_database} \
--warning=${warning} --critical=${critical}"
nrpe::monitor_service { 'postgres-rep-lag':
description => $description,
nrpe_command => $icinga_command,
notes_url => 'https://wikitech.wikimedia.org/wiki/Postgres#Monitoring',
retries => $retries,
}And have a wrapper script for check_postgres_hot_standby_delay that exports the metrics/result for prometheus:
file { '/usr/bin/prometheus_postgresql_replication_lag':
owner => 'root',
group => 'root',
mode => '0755',
content => template('postgresql/prometheus/postgresql_replication_lag.sh.erb'),
}The script though is executed periodically only by profile::maps::osm_replica (i.e. not all pg slaves have said metrics)
This task will track extending the replication lag monitoring to all pg slaves (currently: profile::puppetdb::database, profile::netbox::db and profile::maps::osm_replica) then retire the icinga check and deploy a prometheus/alertmanager generic alert instead.
! MIGRATION TABLE !
| Migrated? (Y/N) | Title | Resource Type | Command | File | Profiles |
|---|---|---|---|---|---|
| N | postgres-rep-lag | Nrpe::Monitor_service | /usr/bin/check_postgres_hot_standby_delay | modules/postgresql/manifests/slave/monitoring.pp:20 | profile::maps::osm_replica, profile::puppetdb::database, profile::netbox::db |