Page MenuHomePhabricator

[toolsbeta.harbor] trove postrgres DB out of space, v2
Closed, ResolvedPublic

Description

This happened on toolsbeta again, the DB failed, so the core service was up but giving login errors.

Opening this task to record what I did.

From the harbor core service logs:

root@toolsbeta-harbor-1:/srv/ops/harbor# docker logs --tail 1000 -f harbor-core
...
2023-08-07T07:23:21Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /service/notifications/tasks/41 request in transaction failed: failed to connect to `host=ttg4ncgzifw.svc.trove.eqiad1.wikimedia.cloud user=harbor database=harbor`: dial error (dial tcp 172.16.5.95:5432: connect: connection refused)"}]}

Then from the trove DB:

(guest-agent-venv) root@harbordb:/root# docker logs -n 100 -f database
...

2023-08-07 08:43:57.280 UTC [16] PANIC:  could not write to file "pg_logical/replorigin_checkpoint.tmp": No space left on device
2023-08-07 08:43:57.305 UTC [1] LOG:  startup process (PID 16) was terminated by signal 6: Aborted
2023-08-07 08:43:57.305 UTC [1] LOG:  aborting startup due to startup process failure
2023-08-07 08:43:57.320 UTC [1] LOG:  database system is shut down

There I was looking to see what was taking space, it turned out to be the pg_wal directory (not wal_archive):

(guest-agent-venv) root@harbordb:/root# df -h                                                                                                                 
...                                                                                                                                         
/dev/sdb        4.9G  4.7G     0 100% /var/lib/postgresql                                      
...

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata# du -hs *                                                                                                                                                                                                                                                                                                 
...
4.6G    pg_wal
...

So I changed the actual postrges configuration manually to disable the pg_wal and archives:

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata/pg_wal# grep wal_level /etc/postgresql/postgresql.conf 
#wal_level = replica                    # minimal, replica, or logical
wal_level = minimal                     # minimal, replica, or logical

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata/pg_wal# grep max_wal_senders /etc/postgresql/postgresql.conf 
#max_wal_senders = 10           # max number of walsender processes
max_wal_senders = 0             # max number of walsender processes

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata/pg_wal# grep archive_mode /etc/postgresql/postgresql.conf 
archive_mode = on               # enables archiving; off, on, or always
archive_mode = off              # enables archiving; off, on, or always

That gives us no replication support (see https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-LEVEL), and no archiving (point-in-time backup), but we are not using it anyhow, and frees a lot of space:

# had to clean up a few wal files to have some space to start the db:
(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata/pg_wal# rm 00000001000000010000001*

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata/pg_wal# docker start database
...

(guest-agent-venv) root@harbordb:/var/lib/postgresql/data/pgdata/pg_wal# df -h
...
/dev/sdb        4.9G  523M  4.1G  12% /var/lib/postgresql
...