Page MenuHomePhabricator

Netbox: add Icinga check for PostgreSQL
Closed, ResolvedPublic

Description

We should add an Icinga check on the health of PostgreSQL on the Netbox instances and their replica.

Event Timeline

Volans triaged this task as Medium priority.Jan 22 2018, 4:22 PM
Volans created this task.

Change 425227 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: import check_postgres.pl

https://gerrit.wikimedia.org/r/425227

We already have some puppet code to monitor postgres replication lag (https://github.com/wikimedia/puppet/blob/production/modules/postgresql/manifests/slave/monitoring.pp). It might not be enabled for netbox. That check is based on the same plugin as proposed by @Dzahn above, but it uses the debian package. We don't have much monitoring of postgres masters, and should definitely add some!

Change 425227 abandoned by Dzahn:
icinga: import check_postgres.pl

https://gerrit.wikimedia.org/r/425227

Thanks Gehel for pointing that out. Using the existing package is the better option. Abandoned.

Dzahn renamed this task from Netbox: add Icinga check for PosgreSQL to Netbox: add Icinga check for PostgreSQL.Apr 14 2018, 12:13 AM

The package providing /usr/bin/check_postgres_hot_standby_delay is check-postgres

https://tracker.debian.org/pkg/check-postgres

Confirmed this is already installed on netmon1002.

Change 426326 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: add postgreSQL slave monitoring

https://gerrit.wikimedia.org/r/426326

Change 426326 merged by Dzahn:
[operations/puppet@production] netbox: add postgreSQL slave monitoring

https://gerrit.wikimedia.org/r/426326

`ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on
FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL off`

regarding a health check on the master itself: Which type of check exactly would that be on the master? So far we just have the replication check on slaves and there are all these aliases for check_postgres that we could use:

check_postgres_archive_ready
check_postgres_autovac_freeze
check_postgres_backends
check_postgres_bloat
check_postgres_checkpoint
check_postgres_cluster_id
check_postgres_commitratio
check_postgres_connection
check_postgres_custom_query
check_postgres_database_size
check_postgres_dbstats
check_postgres_disabled_triggers
check_postgres_disk_space
check_postgres_fsm_pages
check_postgres_fsm_relations
check_postgres_hitratio
check_postgres_hot_standby_delay
check_postgres_index_size
check_postgres_last_analyze
check_postgres_last_autoanalyze
check_postgres_last_autovacuum
check_postgres_last_vacuum
check_postgres_listener
check_postgres_locks
check_postgres_logfile
check_postgres_new_version_bc
check_postgres_new_version_box
check_postgres_new_version_cp
check_postgres_new_version_pg
check_postgres_new_version_tnm
check_postgres_pgagent_jobs
check_postgres_pgbouncer_backends
check_postgres_pgbouncer_checksum
check_postgres_pgb_pool_cl_active
check_postgres_pgb_pool_cl_waiting
check_postgres_pgb_pool_maxwait
check_postgres_pgb_pool_sv_active
check_postgres_pgb_pool_sv_idle
check_postgres_pgb_pool_sv_login
check_postgres_pgb_pool_sv_tested
check_postgres_pgb_pool_sv_used
check_postgres_prepared_txns
check_postgres_query_runtime
check_postgres_query_time
check_postgres_relation_size
check_postgres_replicate_row
check_postgres_same_schema
check_postgres_sequence
check_postgres_settings_checksum
check_postgres_slony_status
check_postgres_table_size
check_postgres_timesync
check_postgres_txn_idle
check_postgres_txn_time
check_postgres_txn_wraparound
check_postgres_version
check_postgres_wal_files

`ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on
FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL off`

It looks like the script is trying to connect over IPv6, but /etc/postgresql/9.6/main/pg_hba.conf only has entries for IPv4.

As to what to monitor on the master, I have no idea what makes the most sense. I'm sure that @Pnorman knows!

The answer in general is it depends. Are you looking for monitoring to diagnose problems, or alarms for health?

I would recommend monitoring for maximum transaction length (for idle transactions), database size, and open connections (compare against max connections). These are all metrics which will go outside normal for a few common problems and are easy to interpret and alarm.

There are other stats which are important, but require more knowledge to interpret. For an example of some munin-based monitoring on a busy postgres server with one database and a primarily read load, see https://munin.openstreetmap.org/openstreetmap/orm.openstreetmap/index.html#postgresql

Looking at https://bucardo.org/check_postgres/check_postgres.pl.html the following look like they have reasonable defaults for simple alarming

  • autovac_freeze
  • backends (a way of alarming on open connections)
  • txn_wraparound

Adding to the above, I would say that most of the other monitoring that can be done can be broken down into performance related metrics, like transactions per second (tps), bgwriters, checkpoints, scans/s by type, tuples/s, etc; and vacuum related metrics, like time since last vacuum, bloat, etc.

Both require some knowledge of what is expected for your server and are hard to alarm. e.g. we will have some static tables in maps. These only get vacuumed when they get reloaded with a cron job, so any vacuum alarming will have to be set up to be aware of that. Some tables are very small and can be very bloated by percentage, while still only being a few kb. Your server might normally go between 100-200 tps, and if it drops to 2 tps and stays there, you want to know, or if it goes to 1k tps, you also want to know. But those might be normal for another server.

Thanks @Gehel and @Pnorman!

I would say let's start with this one:

It looks like the script is trying to connect over IPv6, but /etc/postgresql/9.6/main/pg_hba.conf only has entries for IPv4.

I'll try to add those missing IPv6 entries (instead of trying to stop the script from using IPv6).

Looking at the netbox module it seems /etc/postgresql/9.6/main/pg_hba.conf isn't puppetized while it does contain custom "netbox" strings. So i would say first step is adding that file to puppet as a template.

Looking at the netbox module it seems /etc/postgresql/9.6/main/pg_hba.conf isn't puppetized while it does contain custom "netbox" strings. So i would say first step is adding that file to puppet as a template.

There are a bunch of (somewhat ugly) defines in the postgresql puppet module that can be used to manage users. postgresql::user might or might not do what you need...

We've had the following Icinga UNKNOWN on netmon2001 for the past 6 days:

Postgres Replication Lag - ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on

I've noticed that /etc/postgresql/9.6/main/pg_hba.conf differs between netmon1002 and netmon2001 (among other thigns the latter has no entry for the address mentioned in the icinga error).

It's also unclear to me whether UNKNOWN is the proper severity for this issue, it should perhaps be CRITICAL instead.

It's also unclear to me whether UNKNOWN is the proper severity for this issue, it should perhaps be CRITICAL instead.

UNKNOWN seems reasonable to me. The script could not authenticate and thus the check was not performed. We might want to add a connection check ([[ https://bucardo.org/check_postgres/check_postgres.pl.html#connection | check_postgres_connection ]]) to test that the appropriate users can connect.

What is the current status on this?

I still see the UNKNOWN on netmon2001 Icinga checks and no check on netmon1002 for the PostgreSQL process, but there might have been offline progress I'm not aware.
Given that switching over to Netbox is part of this quarterly goals (see T199083) I think we should progress/complete this to have a stable checks configuration as required by a production service.

I fixed replication between netmon1002 and netmon2001.

The check turned green now:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon2001&service=Postgres+Replication+Lag

The reason was it had "USER DATABASE" instead of "DATABASE USER" in the config and the database it tried to connect to is called "template1" and not "replication". Only the user is called "replication".

We've had the following Icinga UNKNOWN on netmon2001

This works now:

@netmon2001:/etc/nagios/nrpe.d# cat check_postgres-rep-lag.cfg 

# File generated by puppet. DO NOT edit by hand
command[check_postgres-rep-lag]=/usr/bin/check_postgres_hot_standby_delay --host=netmon1002.wikimedia.org,localhost --dbuser=replication --dbpass=<redacted> -dbname=template1 --warning=1048576 
--critical=16777216

@netmon2001:# /usr/bin/check_postgres_hot_standby_delay --host=netmon1002.wikimedia.org,localhost --dbuser=replication --dbpass=<redacted> -dbname=template1 --warning=1048576 --critical=16777216

POSTGRES_HOT_STANDBY_DELAY OK: DB "template1" (host:localhost) 0 and 1 seconds | time=0.10s replay_delay=0;1048576;16777216  receive-delay=0;1048576;16777216 time_delay=1;

But the pg_hba.conf file that was adjusted still needs to be puppetized in some way.

Dzahn raised the priority of this task from Medium to High.Aug 22 2018, 11:14 PM

Change 454723 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: fix psql client auth for monitoring check

https://gerrit.wikimedia.org/r/454723

Change 454723 merged by Dzahn:
[operations/puppet@production] netbox: fix psql client auth for monitoring check

https://gerrit.wikimedia.org/r/454723

Fixed after the change above.

on netmon1002 (master), pg_hba.conf

+host	netbox	replication	2620:0:860:4:208:80:153:110/128	md5

on netmon2001 (slave), nagios nrpe config

+command[check_postgres-rep-lag]=/usr/bin/check_postgres_hot_standby_delay --host=netmon1002.wikimedia.org,localhost --dbuser=replication --dbpass=<redacted> -dbname=netbox --warning=1048576 --critical=16777216

result:

POSTGRES_HOT_STANDBY_DELAY OK: DB "netbox" (host:localhost) 0 and 95334 seconds | time=0.10s replay_delay=0;1048576;16777216  receive-delay=0;1048576;16777216 time_delay=95334;

-> https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon2001&service=Postgres+Replication+Lag

Change 454730 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql::slave::monitoring: make check description configurable

https://gerrit.wikimedia.org/r/454730

Change 454730 merged by Dzahn:
[operations/puppet@production] postgresql::slave::monitoring: make check description configurable

https://gerrit.wikimedia.org/r/454730