Page MenuHomePhabricator

Reduce number of false positive alerts on postgresql lag for maps
Closed, ResolvedPublic

Description

The lag icinga check generate a number of alerts, which recover quickly. It looks like lag suddenly jumps from 0 to a high number (multiple hours) and goes back to zero quickly. We need some investigation to understand why, and reduce the number of false positives to not mask real issues.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.Apr 6 2017, 9:02 AM

Change 346710 had a related patch set uploaded (by Gehel):
[operations/puppet@production] maps - increase number of retries before alert for posttgresql lag check

https://gerrit.wikimedia.org/r/346710

Change 346724 had a related patch set uploaded (by Gehel):
[operations/puppet@production] maps - publish postgresql replication lag to prometheus

https://gerrit.wikimedia.org/r/346724

per discussion on irc

The plugin uses

SELECT
  CASE
    WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0
    ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())
  END AS log_delay;

This is flawed because now() - pg_last_xact_replay_timestamp() is not normally zero or a small value on a system with infrequent updates, so the monitored number can go from zero to very high, whereas a true measure of replication lag can't instantly jump.

There's also an inherent flaw: You can't monitor replication by just looking at the slave, because if replication breaks, the slave will always be up to date with the latest data it's fetched.

What munin does is http://paste.debian.net/926232/. This fetches pg_current_xlog_location() from the master and pg_last_xlog_receive_location(), pg_last_xlog_replay_location() from the slave, converts its format to kb, and reports the differences. https://gist.github.com/Andor/2413097 is a naigos version written in python.

More recent versions of PostgreSQL have the pg_xlog_location_diff function which would help since it would allow all the monitoring to be done by connecting to the master.

Other links:
https://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_pg_streaming_replication/details
https://www.enterprisedb.com/blog/monitoring-approach-streaming-replication-hot-standby-postgresql-93
https://bucardo.org/wiki/Check_postgres (specifically https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl#L5052)

Change 346724 merged by Gehel:
[operations/puppet@production] maps - publish postgresql replication lag to prometheus

https://gerrit.wikimedia.org/r/346724

Graph of postgres replication lag are available on grafana.

I think it might happen when a VACUUM is running on the master, at least today that we have a lot of delay on the maps-test cluster I've noticed that a VACUUM is running since 15h:

datid            | 16385
datname          | gis
pid              | 24208
usesysid         | 10
usename          | postgres
application_name |
client_addr      |
client_hostname  |
client_port      |
backend_start    | 2017-04-05 06:38:54.130002+00
xact_start       | 2017-04-05 06:38:54.152014+00
query_start      | 2017-04-05 06:38:54.152014+00
state_change     | 2017-04-05 06:38:54.152015+00
waiting          | f
state            | active
backend_xid      |
backend_xmin     | 28042809
query            | autovacuum: VACUUM public.planet_osm_ways

I think that the nagios check that we have is flapping because when there is a minimal diff in the position of the binary location transmitted it actually calculate the difference with pg_last_xact_replay_timestamp() that seems old but I guess that's because we're not writing at all on the master. Is that correct? (might depends on how we've configured the replication but usually it's on the binary on disk representation, not the actual transactions)
We already have the master position because we query for SELECT * from pg_stat_replication; so we could just get the best *_location variable printed there and use select pg_xlog_location_diff() to get the diff in bytes and alert on that (or get a better position with another query).

My 2 random cents looking at it quickly tonight that I saw it was flapping.

Change 346962 had a related patch set uploaded (by Gehel):
[operations/puppet@production] postgresql - introduce the check-postgres package for postgres monitoring

https://gerrit.wikimedia.org/r/346962

Change 346963 had a related patch set uploaded (by Gehel):
[operations/puppet@production] [WIP] postgresql - use the check-postgres package for icinga checks

https://gerrit.wikimedia.org/r/346963

Change 346964 had a related patch set uploaded (by Gehel):
[operations/puppet@production] postgresql - cleanup dead code after migration to check-postgres package

https://gerrit.wikimedia.org/r/346964

Change 346962 merged by Gehel:
[operations/puppet@production] postgresql - introduce the check-postgres package for postgres monitoring

https://gerrit.wikimedia.org/r/346962

Change 346963 merged by Gehel:
[operations/puppet@production] postgresql - use the check-postgres package for icinga checks

https://gerrit.wikimedia.org/r/346963

Mentioned in SAL (#wikimedia-operations) [2017-04-10T14:31:12Z] <gehel> deploying new psotgresql replication check, might generate a few icinga alerts -T162345

Change 346964 merged by Gehel:
[operations/puppet@production] postgresql - cleanup dead code after migration to check-postgres package

https://gerrit.wikimedia.org/r/346964

No new false positive since changes have been deployed. Things are looking good!

Change 346710 abandoned by Gehel:
maps - increase number of retries before alert for posttgresql lag check

Reason:
Better solution is to check lag in bytes instead of time.

https://gerrit.wikimedia.org/r/346710

debt added a subscriber: debt.

This has been deployed and is working well. :)