The DBapp trove instance (id ee0c90b0-5d21-4d41-9abf-cdabca2787c3) has filled up it's disk, and is in a weird Restart_Required state which does not allow self-service resizing.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | BUG REPORT | None | T355082 GLAMWiki Dashboard not loading | ||
| Resolved | taavi | T355138 Rescue DBapp trove instance in glamwikidashboard project |
Event Timeline
I manually added 10GiB to the volume, and updated the Trove database to match. This gives some breathing room:
/dev/sdb 255G 233G 9.5G 97% /var/lib/postgresql
Now failing with:
chmod: changing permissions of '/var/run/postgresql': Operation not permitted ... 2024-01-16 12:56:25.997 UTC [1] FATAL: could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": Permission denied
ubuntu@dbapp:~$ sudo chmod 777 /var/run/postgresql/
After that it started but complained about an unclean shutdown:
2024-01-16 12:58:58.118 UTC [1] LOG: starting PostgreSQL 12.7 (Debian 12.7-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit 2024-01-16 12:58:58.118 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2024-01-16 12:58:58.118 UTC [1] LOG: listening on IPv6 address "::", port 5432 2024-01-16 12:58:58.123 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2024-01-16 12:58:58.877 UTC [13] LOG: database system was shut down in recovery at 2024-01-15 22:21:07 UTC 2024-01-16 12:58:58.898 UTC [13] LOG: database system was not properly shut down; automatic recovery in progress 2024-01-16 12:58:58.937 UTC [13] LOG: invalid record length at 65E/8BFFF6D0: wanted 24, got 0 2024-01-16 12:58:58.937 UTC [13] LOG: redo is not required 2024-01-16 12:58:59.288 UTC [1] LOG: database system is ready to accept connections 2024-01-16 12:58:59.301 UTC [18] LOG: archive command failed with exit code 1 2024-01-16 12:58:59.301 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/000000010000065D00000033 2024-01-16 12:59:00.306 UTC [18] LOG: archive command failed with exit code 1 2024-01-16 12:59:00.306 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/000000010000065D00000033 2024-01-16 12:59:01.311 UTC [18] LOG: archive command failed with exit code 1 2024-01-16 12:59:01.311 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/000000010000065D00000033 2024-01-16 12:59:01.311 UTC [18] WARNING: archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later
That file already exists:
ubuntu@dbapp:~$ ls -la /var/lib/postgresql/data/wal_archive/000000010000065D00000033 -rw------- 1 database database 11395072 Jan 12 04:38 /var/lib/postgresql/data/wal_archive/000000010000065D00000033
Trove still thinks it needs a reboot, I'll see if that works now.
The reboot finished cleanly and the instance is now showing as Active/Healthy, although the database is still complaining about those archive files already existing.
@Yonatan.Bendahan could you please have a look at the instance on your side? Note that the disk is still almost full, but you should now be able to free up some space or re-size the instance.
I have stopped the daily service from adding more data to the DB to prevent it from filling up again. I think now, we should rescue the DB again (hopefully can be done on your side), resize it to at least 500GB and in my opinion also have a 500GB volume for backup. The space needed for the project is growing quickly and we might even need more space in the future.
This is done. For those following along at home, here's what I did:
- Logged into the VM (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Accessing_Trove_guest_VMs)
- Stopped the docker container so that that the Trove server wouldn't be upset about it being in an error state
- Resized the volume using the Trove/Horizon UI
- Discovered that resizing failed during remount due to the sdb/vdb thing (Trove expects a newish linux distro with volumes under /dev/vdb but this is an older VM)
- mount /var/lib/postgresql
- resize2fs /dev/sdb
@YonatanWMIL, please keep an eye on this to avoid it filling up again :)
This looks to be full again.
/dev/sdb 492G 469G 1.8M 100% /var/lib/postgresql
This issue is much easier to manage if caught before it runs out of space... do you have any sense of what size it will reach before it stops growing? And/or can you do some db optimization so as to not consume so much space so rapidly?
I can stop the DB from growing by stopping the daily data insertion. I just didn't expect it to fill up so fast after doubling the disk space. Going to have to do some investigation in order to understand what's going on but when the DB is down I can't do anything of course. Is it possible to give me access to the VM (since I have no access to the SSH key as of now) instead of me having to trouble you guys every time something like this happens?
Starting a VACUUM everywhere to hopefully free up some space:
ubuntu@dbapp:~$ docker exec -it database vacuumdb -U postgres --all --echo
Postgres seems somewhat unhappy with persisting data to disk.
2024-02-16 14:22:55.253 UTC [18] WARNING: archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later 2024-02-16 14:22:55.273 UTC [18] LOG: archive command failed with exit code 1 2024-02-16 14:22:55.273 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000 0065D00000033 2024-02-16 14:22:55.743 UTC [18] LOG: archive command failed with exit code 1 2024-02-16 14:22:55.743 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000 0065D00000033 2024-02-16 14:22:56.282 UTC [18] LOG: archive command failed with exit code 1 2024-02-16 14:22:56.282 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000 0065D00000033 2024-02-16 14:22:56.282 UTC [18] WARNING: archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later 2024-02-16 14:22:56.303 UTC [18] LOG: archive command failed with exit code 1 2024-02-16 14:22:56.303 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000 0065D00000033 2024-02-16 14:22:56.840 UTC [18] LOG: archive command failed with exit code 1 2024-02-16 14:22:56.840 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000 0065D00000033 2024-02-16 14:22:57.360 UTC [18] LOG: archive command failed with exit code 1 2024-02-16 14:22:57.360 UTC [18] DETAIL: The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000 0065D00000033 2024-02-16 14:22:57.360 UTC [18] WARNING: archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later
In addition, the VACUUM is at least so far only reducing the amount of available free space, not increasing as I'd hope.
That WAL file exists both in pg_wal and wal_archive:
ubuntu@dbapp:/var/lib/postgresql/data$ sudo ls -la pgdata/pg_wal/000000010000065D00000033 wal_archive/000000010000065D00000033 -rw-r-xr-x 1 database database 16777216 Jan 12 04:45 pgdata/pg_wal/000000010000065D00000033 -rw------- 1 database database 11395072 Jan 12 04:38 wal_archive/000000010000065D00000033
I tried removing the duplicate file from the archive:
ubuntu@dbapp:/var/lib/postgresql/data$ sudo mv wal_archive/000000010000065D00000033 /root
I have no name!@550490a61acf:/var/lib/postgresql/data$ pg_controldata | grep "Latest checkpoint's REDO WAL file" Latest checkpoint's REDO WAL file: 00000001000006990000002B I have no name!@550490a61acf:/var/lib/postgresql/data$ pg_archivecleanup -d wal_archive/ 00000001000006990000002B
Ok, the database is back up now with some breathing room.
/dev/sdb 501G 348G 131G 73% /var/lib/postgresql
wal_archive itself is tiny, but seems to be increasing. I'll check later where that and the total disk space usage ends up.
ubuntu@dbapp:~$ du -sh /var/lib/postgresql/data/wal_archive/ 5.8G /var/lib/postgresql/data/wal_archive/
@Andrew seems like Trove configures PostgreSQL to just keep WALs indefinitely? Or at least that's what https://docs.openstack.org/trove/latest/admin/database_management.html suggests to me.
The archive folder is growing at a very worrying rate:
root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G 428G 51G 90% /var/lib/postgresql root@dbapp:/var/lib/postgresql/data# du -sh pgdata/ wal_archive/ 320G pgdata/ 109G wal_archive/
I repeated T355138#9550687.
root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G 323G 156G 68% /var/lib/postgresql root@dbapp:/var/lib/postgresql/data# du -sh * 320G pgdata 3.2G wal_archive
Disk usage seems to finally have stabilized at this:
root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G 398G 81G 84% /var/lib/postgresql root@dbapp:/var/lib/postgresql/data# du -sh * 320G pgdata 79G wal_archive
Something has happened and now most of the disk is free:
root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G 189G 290G 40% /var/lib/postgresql root@dbapp:/var/lib/postgresql/data# du -sh * 111G pgdata 79G wal_archive
In particular the pg_wal folder has shrunk by quite a bit.
root@dbapp:/var/lib/postgresql/data/pgdata# du -sh pg_wal 1.1G pg_wal
So it seems like the real reason for the doubled disk space usage was that the disk initially filling up caused a WAL file archival to get stuck, and due to that it wasn't getting cleaned up properly. Me unblocking that started archival of a backlog of WAL files that took quite a bit of space initially, but now that they were properly archived and cleaned up the disk is back to an expected level of usage.
@YonatanWMIL I think you can now re-enable the the daily data generating jobs, and I'll keep an eye on the instance for a few days just in case something is still broken (although I doubt that's the case anymore).
Still looks good:
ubuntu@dbapp:~$ df -h /var/lib/postgresql/ Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G 189G 290G 40% /var/lib/postgresql
@YonatanWMIL did you get around enabling the previously disabled cron jobs again?
I'm starting to fill the gaps that we have due to the DB going down. Should take about a day. Afterwards i'll reactivate the cron job and update.
And the disk space usage still looks okay:
ubuntu@dbapp:~$ df -h /var/lib/postgresql Filesystem Size Used Avail Use% Mounted on /dev/sdb 501G 238G 241G 50% /var/lib/postgresql
And in guest-agent.service logs I see that it's pruning some older WAL files. So everything seems to be working normally now.