The lag icinga check generate a number of alerts, which recover quickly. It looks like lag suddenly jumps from 0 to a high number (multiple hours) and goes back to zero quickly. We need some investigation to understand why, and reduce the number of false positives to not mask real issues.
|operations/puppet : production||maps - increase number of retries before alert for posttgresql lag check|
|operations/puppet : production||postgresql - cleanup dead code after migration to check-postgres package|
|operations/puppet : production||postgresql - use the check-postgres package for icinga checks|
|operations/puppet : production||postgresql - introduce the check-postgres package for postgres monitoring|
|operations/puppet : production||maps - publish postgresql replication lag to prometheus|
per discussion on irc
The plugin uses
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
This is flawed because now() - pg_last_xact_replay_timestamp() is not normally zero or a small value on a system with infrequent updates, so the monitored number can go from zero to very high, whereas a true measure of replication lag can't instantly jump.
There's also an inherent flaw: You can't monitor replication by just looking at the slave, because if replication breaks, the slave will always be up to date with the latest data it's fetched.
What munin does is http://paste.debian.net/926232/. This fetches pg_current_xlog_location() from the master and pg_last_xlog_receive_location(), pg_last_xlog_replay_location() from the slave, converts its format to kb, and reports the differences. https://gist.github.com/Andor/2413097 is a naigos version written in python.
More recent versions of PostgreSQL have the pg_xlog_location_diff function which would help since it would allow all the monitoring to be done by connecting to the master.
https://bucardo.org/wiki/Check_postgres (specifically https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl#L5052)
I think it might happen when a VACUUM is running on the master, at least today that we have a lot of delay on the maps-test cluster I've noticed that a VACUUM is running since 15h:
datid | 16385 datname | gis pid | 24208 usesysid | 10 usename | postgres application_name | client_addr | client_hostname | client_port | backend_start | 2017-04-05 06:38:54.130002+00 xact_start | 2017-04-05 06:38:54.152014+00 query_start | 2017-04-05 06:38:54.152014+00 state_change | 2017-04-05 06:38:54.152015+00 waiting | f state | active backend_xid | backend_xmin | 28042809 query | autovacuum: VACUUM public.planet_osm_ways
I think that the nagios check that we have is flapping because when there is a minimal diff in the position of the binary location transmitted it actually calculate the difference with pg_last_xact_replay_timestamp() that seems old but I guess that's because we're not writing at all on the master. Is that correct? (might depends on how we've configured the replication but usually it's on the binary on disk representation, not the actual transactions)
We already have the master position because we query for SELECT * from pg_stat_replication; so we could just get the best *_location variable printed there and use select pg_xlog_location_diff() to get the diff in bytes and alert on that (or get a better position with another query).
My 2 random cents looking at it quickly tonight that I saw it was flapping.