The lag icinga check generate a number of alerts, which recover quickly. It looks like lag suddenly jumps from 0 to a high number (multiple hours) and goes back to zero quickly. We need some investigation to understand why, and reduce the number of false positives to not mask real issues.
Description
Details
Event Timeline
Change 346710 had a related patch set uploaded (by Gehel):
[operations/puppet@production] maps - increase number of retries before alert for posttgresql lag check
Change 346724 had a related patch set uploaded (by Gehel):
[operations/puppet@production] maps - publish postgresql replication lag to prometheus
per discussion on irc
The plugin uses
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;
This is flawed because now() - pg_last_xact_replay_timestamp() is not normally zero or a small value on a system with infrequent updates, so the monitored number can go from zero to very high, whereas a true measure of replication lag can't instantly jump.
There's also an inherent flaw: You can't monitor replication by just looking at the slave, because if replication breaks, the slave will always be up to date with the latest data it's fetched.
What munin does is http://paste.debian.net/926232/. This fetches pg_current_xlog_location() from the master and pg_last_xlog_receive_location(), pg_last_xlog_replay_location() from the slave, converts its format to kb, and reports the differences. https://gist.github.com/Andor/2413097 is a naigos version written in python.
More recent versions of PostgreSQL have the pg_xlog_location_diff function which would help since it would allow all the monitoring to be done by connecting to the master.
Other links:
https://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_pg_streaming_replication/details
https://www.enterprisedb.com/blog/monitoring-approach-streaming-replication-hot-standby-postgresql-93
https://bucardo.org/wiki/Check_postgres (specifically https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl#L5052)
Change 346724 merged by Gehel:
[operations/puppet@production] maps - publish postgresql replication lag to prometheus
I think it might happen when a VACUUM is running on the master, at least today that we have a lot of delay on the maps-test cluster I've noticed that a VACUUM is running since 15h:
datid | 16385 datname | gis pid | 24208 usesysid | 10 usename | postgres application_name | client_addr | client_hostname | client_port | backend_start | 2017-04-05 06:38:54.130002+00 xact_start | 2017-04-05 06:38:54.152014+00 query_start | 2017-04-05 06:38:54.152014+00 state_change | 2017-04-05 06:38:54.152015+00 waiting | f state | active backend_xid | backend_xmin | 28042809 query | autovacuum: VACUUM public.planet_osm_ways
I think that the nagios check that we have is flapping because when there is a minimal diff in the position of the binary location transmitted it actually calculate the difference with pg_last_xact_replay_timestamp() that seems old but I guess that's because we're not writing at all on the master. Is that correct? (might depends on how we've configured the replication but usually it's on the binary on disk representation, not the actual transactions)
We already have the master position because we query for SELECT * from pg_stat_replication; so we could just get the best *_location variable printed there and use select pg_xlog_location_diff() to get the diff in bytes and alert on that (or get a better position with another query).
My 2 random cents looking at it quickly tonight that I saw it was flapping.
Change 346962 had a related patch set uploaded (by Gehel):
[operations/puppet@production] postgresql - introduce the check-postgres package for postgres monitoring
Change 346963 had a related patch set uploaded (by Gehel):
[operations/puppet@production] [WIP] postgresql - use the check-postgres package for icinga checks
Change 346964 had a related patch set uploaded (by Gehel):
[operations/puppet@production] postgresql - cleanup dead code after migration to check-postgres package
Change 346962 merged by Gehel:
[operations/puppet@production] postgresql - introduce the check-postgres package for postgres monitoring
Change 346963 merged by Gehel:
[operations/puppet@production] postgresql - use the check-postgres package for icinga checks
Mentioned in SAL (#wikimedia-operations) [2017-04-10T14:31:12Z] <gehel> deploying new psotgresql replication check, might generate a few icinga alerts -T162345
Change 346964 merged by Gehel:
[operations/puppet@production] postgresql - cleanup dead code after migration to check-postgres package
Change 346710 abandoned by Gehel:
maps - increase number of retries before alert for posttgresql lag check
Reason:
Better solution is to check lag in bytes instead of time.