Page MenuHomePhabricator

Postgres Replication broken for Puppetdb
Closed, ResolvedPublic

Description

There is an ongoing alert for Postgres replication on puppetdb2003.

On puppetdb2003 I see errors starting from:

2025-01-06 22:56:09 GMT FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000804300000091 has already been removed

On puppetdb1003 I see an errors before (probably a red herring):

2025-01-06 22:33:01.568 GMT [db:puppetdb,sess:677c5153.2527cd,pid:2435021,vtid:46/497365,tid:173777325] ERROR:  duplicate key value violates unique constraint "resource_params_cache_pkey"

But then:

2025-01-06 22:56:06.808 GMT [db:[unknown],sess:6758695a.3b0eab,pid:3870379,vtid:3/0,tid:0] ERROR:  requested WAL segment 000000010000804300000091 has already been removed

Event Timeline

Found some intro material in https://www.crunchydata.com/blog/how-to-recover-when-postgresql-is-missing-a-wal-file (I admit my ignorance about Postgres)

I seems that the only option that we have is to resync postgres on 2003 (that should be the current replica of 1003) and possibly tune the configuration for WAL archiving (IIUC wal files are basically a journal).

After a chat with Alex and Moritz on IRC, this is what came up:

  • If the WAL files are already deleted, there is no other way then resyncing the replica.
  • puppetdb on 1003 reads and writes from the local postgres replica, meanwhile on 2003 it reads from the local replica and writes to the remote replica on 1003.
  • Puppetdb on 2003 needs to be stopped alongside with postgres (since the latter handles the former's read traffic) and it is better to do it with puppet disabled fleetwide (to avoid failures etc.. we don't know how much it will take to resync the replica).
  • There is a cookbook that takes care of initializing a postgres db, basically calling /usr/local/bin/pg-resync-replica behind the scenes.

Proposed procedure:

  • Stop puppet fleetwide
  • Downtime puppetdb2003, stop puppetdb and postgres on it.
  • Run the postgres-init.py cookbook for 2003.
  • Start postgres on 2003 and verify the replication.
  • Start puppetdb, and when it is up and running re-enable puppet fleetwide.

Follow ups:

  • What is the best configuration for the WAL files retention to avoid this in the future?
  • Document what we learned on Wikitech

Icinga downtime and Alertmanager silence (ID=d97110e1-aa38-456a-80cf-1b4661fc7681) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: Resync postgres

puppetdb2003.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2025-01-07T13:39:54Z] <elukey> stop puppetdb and postgres on puppetdb2003 - T383114

  • Document what we learned on Wikitech

Please add (or link from there) this to https://wikitech.wikimedia.org/wiki/Puppet#PuppetDB, the Puppet page is currently a bit of a mess with lots of outdated docs mixing Puppet 5 and Puppet 7, but I'm slowly cleaning it up for Puppet 7, so let's ideally keep this as the central page for debugging/introductions to Puppet.

Next step: is there any config that we can add to postgres to avoid this issue in the future?

Next step: is there any config that we can add to postgres to avoid this issue in the future?

Fundamentally in the current setup? No. It all boils down to a (series of) event(s) that make shipping WALs from the primary to the secondary impossible, thus stalling (and eventually breaking) streaming replication (Logical replication is a different story, one I am not really able to tell).

But what can be tuned is the amount of WALs kept around. The setting differs a little by version, it was called wal_keep_segments in the past and it's now named wal_keep_size, and works a bit differently but it's the same basic idea. A minimum amount of WALs is guaranteed to kept around. By tuning that knob, one can weather through an event that stops streaming replication.

The caveat is that this isn't time based (you can't say "keep X days"), but rather data based (keep X GBs). So it's not possible to indeed guarantee that an X hours long event won't break replication, as an event with a sufficient high data rate will consume any amount of buffer.

We currently default to 128 from what I can tell. There is puppetization support already and is used for the Open Street Map database that sets it to 768. The downside is the amount of disk space required. Each segment is 16MB, so 128 requires at least 2GB of disk space available just for that amount of WALs. 768 requires 12GB of disk, which might have sounded large back in the day, but it probably don't now?

Change #1108768 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::puppetdb: increase WAL kept segments

https://gerrit.wikimedia.org/r/1108768

Change #1108768 merged by Elukey:

[operations/puppet@production] role::puppetdb: increase WAL kept segments

https://gerrit.wikimedia.org/r/1108768

On puppetdb1003:

postgres=# SELECT pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)

postgres=# SELECT * from pg_settings WHERE name = 'wal_keep_size';
 wal_keep_size | 12288   | MB   | Replication / Sending Servers | Sets the size of WAL files held for standby servers. |            | sighup  | integer | configuration file | 0       | 2147483647 |          | 0        | 12288     | /etc/postgresql/15/main/master.conf |          5 | f
(1 row)
elukey claimed this task.

I think that we can declare this solved, we'll see if the issue re-happens in the future.

Change #1108798 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloud.yaml: add missing wal_keep_segments

https://gerrit.wikimedia.org/r/1108798

Change #1108798 merged by David Caro:

[operations/puppet@production] cloud.yaml: add missing wal_keep_segments

https://gerrit.wikimedia.org/r/1108798