We should add an Icinga check on the health of PostgreSQL on the Netbox instances and their replica.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | ayounsi | T184634 Error in postgres puppettization for new installation (was Netbox: postgres cannot be restarted w/ current config) | |||
Resolved | None | T199083 Migrate the hardware inventory from Racktables to Netbox | |||
Resolved | Dzahn | T185504 Netbox: add Icinga check for PostgreSQL |
Event Timeline
seems like the best one. hast the most votes:
https://exchange.nagios.org/directory/Plugins/Databases/PostgresQL/check_postgres/details
latest:
https://github.com/bucardo/check_postgres/blob/master/check_postgres.pl
- Greg Sabino Mullane <greg@endpoint.com>
- End Point Corporation http://www.endpoint.com/
- BSD licensed
Change 425227 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: import check_postgres.pl
We already have some puppet code to monitor postgres replication lag (https://github.com/wikimedia/puppet/blob/production/modules/postgresql/manifests/slave/monitoring.pp). It might not be enabled for netbox. That check is based on the same plugin as proposed by @Dzahn above, but it uses the debian package. We don't have much monitoring of postgres masters, and should definitely add some!
Thanks Gehel for pointing that out. Using the existing package is the better option. Abandoned.
The package providing /usr/bin/check_postgres_hot_standby_delay is check-postgres
https://tracker.debian.org/pkg/check-postgres
Confirmed this is already installed on netmon1002.
Change 426326 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: add postgreSQL slave monitoring
Change 426326 merged by Dzahn:
[operations/puppet@production] netbox: add postgreSQL slave monitoring
- puppet added the nrpe config part on netmon2001 (the slave) but not on netmon1002 (good)
- puppet added the icinga config part on einsteinium
- we got this check now but it has status UNKNOWN https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon2001&service=Postgres+Replication+Lag
`ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on
FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL off`
regarding a health check on the master itself: Which type of check exactly would that be on the master? So far we just have the replication check on slaves and there are all these aliases for check_postgres that we could use:
check_postgres_archive_ready
check_postgres_autovac_freeze
check_postgres_backends
check_postgres_bloat
check_postgres_checkpoint
check_postgres_cluster_id
check_postgres_commitratio
check_postgres_connection
check_postgres_custom_query
check_postgres_database_size
check_postgres_dbstats
check_postgres_disabled_triggers
check_postgres_disk_space
check_postgres_fsm_pages
check_postgres_fsm_relations
check_postgres_hitratio
check_postgres_hot_standby_delay
check_postgres_index_size
check_postgres_last_analyze
check_postgres_last_autoanalyze
check_postgres_last_autovacuum
check_postgres_last_vacuum
check_postgres_listener
check_postgres_locks
check_postgres_logfile
check_postgres_new_version_bc
check_postgres_new_version_box
check_postgres_new_version_cp
check_postgres_new_version_pg
check_postgres_new_version_tnm
check_postgres_pgagent_jobs
check_postgres_pgbouncer_backends
check_postgres_pgbouncer_checksum
check_postgres_pgb_pool_cl_active
check_postgres_pgb_pool_cl_waiting
check_postgres_pgb_pool_maxwait
check_postgres_pgb_pool_sv_active
check_postgres_pgb_pool_sv_idle
check_postgres_pgb_pool_sv_login
check_postgres_pgb_pool_sv_tested
check_postgres_pgb_pool_sv_used
check_postgres_prepared_txns
check_postgres_query_runtime
check_postgres_query_time
check_postgres_relation_size
check_postgres_replicate_row
check_postgres_same_schema
check_postgres_sequence
check_postgres_settings_checksum
check_postgres_slony_status
check_postgres_table_size
check_postgres_timesync
check_postgres_txn_idle
check_postgres_txn_time
check_postgres_txn_wraparound
check_postgres_version
check_postgres_wal_files
It looks like the script is trying to connect over IPv6, but /etc/postgresql/9.6/main/pg_hba.conf only has entries for IPv4.
As to what to monitor on the master, I have no idea what makes the most sense. I'm sure that @Pnorman knows!
The answer in general is it depends. Are you looking for monitoring to diagnose problems, or alarms for health?
I would recommend monitoring for maximum transaction length (for idle transactions), database size, and open connections (compare against max connections). These are all metrics which will go outside normal for a few common problems and are easy to interpret and alarm.
There are other stats which are important, but require more knowledge to interpret. For an example of some munin-based monitoring on a busy postgres server with one database and a primarily read load, see https://munin.openstreetmap.org/openstreetmap/orm.openstreetmap/index.html#postgresql
Looking at https://bucardo.org/check_postgres/check_postgres.pl.html the following look like they have reasonable defaults for simple alarming
- autovac_freeze
- backends (a way of alarming on open connections)
- txn_wraparound
Adding to the above, I would say that most of the other monitoring that can be done can be broken down into performance related metrics, like transactions per second (tps), bgwriters, checkpoints, scans/s by type, tuples/s, etc; and vacuum related metrics, like time since last vacuum, bloat, etc.
Both require some knowledge of what is expected for your server and are hard to alarm. e.g. we will have some static tables in maps. These only get vacuumed when they get reloaded with a cron job, so any vacuum alarming will have to be set up to be aware of that. Some tables are very small and can be very bloated by percentage, while still only being a few kb. Your server might normally go between 100-200 tps, and if it drops to 2 tps and stays there, you want to know, or if it goes to 1k tps, you also want to know. But those might be normal for another server.
I would say let's start with this one:
I'll try to add those missing IPv6 entries (instead of trying to stop the script from using IPv6).
Looking at the netbox module it seems /etc/postgresql/9.6/main/pg_hba.conf isn't puppetized while it does contain custom "netbox" strings. So i would say first step is adding that file to puppet as a template.
There are a bunch of (somewhat ugly) defines in the postgresql puppet module that can be used to manage users. postgresql::user might or might not do what you need...
We've had the following Icinga UNKNOWN on netmon2001 for the past 6 days:
Postgres Replication Lag - ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on
I've noticed that /etc/postgresql/9.6/main/pg_hba.conf differs between netmon1002 and netmon2001 (among other thigns the latter has no entry for the address mentioned in the icinga error).
It's also unclear to me whether UNKNOWN is the proper severity for this issue, it should perhaps be CRITICAL instead.
UNKNOWN seems reasonable to me. The script could not authenticate and thus the check was not performed. We might want to add a connection check ([[ https://bucardo.org/check_postgres/check_postgres.pl.html#connection | check_postgres_connection ]]) to test that the appropriate users can connect.
What is the current status on this?
I still see the UNKNOWN on netmon2001 Icinga checks and no check on netmon1002 for the PostgreSQL process, but there might have been offline progress I'm not aware.
Given that switching over to Netbox is part of this quarterly goals (see T199083) I think we should progress/complete this to have a stable checks configuration as required by a production service.
I fixed replication between netmon1002 and netmon2001.
The check turned green now:
The reason was it had "USER DATABASE" instead of "DATABASE USER" in the config and the database it tried to connect to is called "template1" and not "replication". Only the user is called "replication".
This works now:
@netmon2001:/etc/nagios/nrpe.d# cat check_postgres-rep-lag.cfg # File generated by puppet. DO NOT edit by hand command[check_postgres-rep-lag]=/usr/bin/check_postgres_hot_standby_delay --host=netmon1002.wikimedia.org,localhost --dbuser=replication --dbpass=<redacted> -dbname=template1 --warning=1048576 --critical=16777216 @netmon2001:# /usr/bin/check_postgres_hot_standby_delay --host=netmon1002.wikimedia.org,localhost --dbuser=replication --dbpass=<redacted> -dbname=template1 --warning=1048576 --critical=16777216 POSTGRES_HOT_STANDBY_DELAY OK: DB "template1" (host:localhost) 0 and 1 seconds | time=0.10s replay_delay=0;1048576;16777216 receive-delay=0;1048576;16777216 time_delay=1;
But the pg_hba.conf file that was adjusted still needs to be puppetized in some way.
Change 454723 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: fix psql client auth for monitoring check
Change 454723 merged by Dzahn:
[operations/puppet@production] netbox: fix psql client auth for monitoring check
Fixed after the change above.
on netmon1002 (master), pg_hba.conf
+host netbox replication 2620:0:860:4:208:80:153:110/128 md5
on netmon2001 (slave), nagios nrpe config
+command[check_postgres-rep-lag]=/usr/bin/check_postgres_hot_standby_delay --host=netmon1002.wikimedia.org,localhost --dbuser=replication --dbpass=<redacted> -dbname=netbox --warning=1048576 --critical=16777216
result:
POSTGRES_HOT_STANDBY_DELAY OK: DB "netbox" (host:localhost) 0 and 95334 seconds | time=0.10s replay_delay=0;1048576;16777216 receive-delay=0;1048576;16777216 time_delay=95334;
Change 454730 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql::slave::monitoring: make check description configurable
Change 454730 merged by Dzahn:
[operations/puppet@production] postgresql::slave::monitoring: make check description configurable
netbox related checks are now grouped together at
https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=netbox