Page MenuHomePhabricator

Rescue DBapp trove instance in glamwikidashboard project
Closed, ResolvedPublic

Description

The DBapp trove instance (id ee0c90b0-5d21-4d41-9abf-cdabca2787c3) has filled up it's disk, and is in a weird Restart_Required state which does not allow self-service resizing.

Event Timeline

I manually added 10GiB to the volume, and updated the Trove database to match. This gives some breathing room:

/dev/sdb        255G  233G  9.5G  97% /var/lib/postgresql

Now failing with:

chmod: changing permissions of '/var/run/postgresql': Operation not permitted
...
2024-01-16 12:56:25.997 UTC [1] FATAL:  could not create lock file "/var/run/postgresql/.s.PGSQL.5432.lock": Permission denied
ubuntu@dbapp:~$ sudo chmod 777 /var/run/postgresql/

After that it started but complained about an unclean shutdown:

2024-01-16 12:58:58.118 UTC [1] LOG:  starting PostgreSQL 12.7 (Debian 12.7-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2024-01-16 12:58:58.118 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-01-16 12:58:58.118 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-01-16 12:58:58.123 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-01-16 12:58:58.877 UTC [13] LOG:  database system was shut down in recovery at 2024-01-15 22:21:07 UTC
2024-01-16 12:58:58.898 UTC [13] LOG:  database system was not properly shut down; automatic recovery in progress
2024-01-16 12:58:58.937 UTC [13] LOG:  invalid record length at 65E/8BFFF6D0: wanted 24, got 0
2024-01-16 12:58:58.937 UTC [13] LOG:  redo is not required
2024-01-16 12:58:59.288 UTC [1] LOG:  database system is ready to accept connections
2024-01-16 12:58:59.301 UTC [18] LOG:  archive command failed with exit code 1
2024-01-16 12:58:59.301 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/000000010000065D00000033
2024-01-16 12:59:00.306 UTC [18] LOG:  archive command failed with exit code 1
2024-01-16 12:59:00.306 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/000000010000065D00000033
2024-01-16 12:59:01.311 UTC [18] LOG:  archive command failed with exit code 1
2024-01-16 12:59:01.311 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/000000010000065D00000033
2024-01-16 12:59:01.311 UTC [18] WARNING:  archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later

That file already exists:

ubuntu@dbapp:~$ ls -la /var/lib/postgresql/data/wal_archive/000000010000065D00000033
-rw------- 1 database database 11395072 Jan 12 04:38 /var/lib/postgresql/data/wal_archive/000000010000065D00000033

Trove still thinks it needs a reboot, I'll see if that works now.

The reboot finished cleanly and the instance is now showing as Active/Healthy, although the database is still complaining about those archive files already existing.

@Yonatan.Bendahan could you please have a look at the instance on your side? Note that the disk is still almost full, but you should now be able to free up some space or re-size the instance.

I have stopped the daily service from adding more data to the DB to prevent it from filling up again. I think now, we should rescue the DB again (hopefully can be done on your side), resize it to at least 500GB and in my opinion also have a 500GB volume for backup. The space needed for the project is growing quickly and we might even need more space in the future.

Andrew subscribed.

I have stopped the daily service from adding more data to the DB to prevent it from filling up again. I think now, we should rescue the DB again (hopefully can be done on your side), resize it to at least 500GB and in my opinion also have a 500GB volume for backup. The space needed for the project is growing quickly and we might even need more space in the future.

This is done. For those following along at home, here's what I did:

  • Logged into the VM (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Accessing_Trove_guest_VMs)
  • Stopped the docker container so that that the Trove server wouldn't be upset about it being in an error state
  • Resized the volume using the Trove/Horizon UI
  • Discovered that resizing failed during remount due to the sdb/vdb thing (Trove expects a newish linux distro with volumes under /dev/vdb but this is an older VM)
  • mount /var/lib/postgresql
  • resize2fs /dev/sdb

@YonatanWMIL, please keep an eye on this to avoid it filling up again :)

This looks to be full again.

/dev/sdb 492G 469G 1.8M 100% /var/lib/postgresql

This issue is much easier to manage if caught before it runs out of space... do you have any sense of what size it will reach before it stops growing? And/or can you do some db optimization so as to not consume so much space so rapidly?

I can stop the DB from growing by stopping the daily data insertion. I just didn't expect it to fill up so fast after doubling the disk space. Going to have to do some investigation in order to understand what's going on but when the DB is down I can't do anything of course. Is it possible to give me access to the VM (since I have no access to the SSH key as of now) instead of me having to trouble you guys every time something like this happens?

fnegri triaged this task as Medium priority.
fnegri subscribed.

The database is back up with about 10G of empty space.

Starting a VACUUM everywhere to hopefully free up some space:

ubuntu@dbapp:~$ docker exec -it database vacuumdb -U postgres --all --echo

Postgres seems somewhat unhappy with persisting data to disk.

2024-02-16 14:22:55.253 UTC [18] WARNING:  archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later                                                                                        
2024-02-16 14:22:55.273 UTC [18] LOG:  archive command failed with exit code 1
2024-02-16 14:22:55.273 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000
0065D00000033                                                                                                                                                                                                                           
2024-02-16 14:22:55.743 UTC [18] LOG:  archive command failed with exit code 1
2024-02-16 14:22:55.743 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000
0065D00000033                                                                                                                                                                                                                           
2024-02-16 14:22:56.282 UTC [18] LOG:  archive command failed with exit code 1
2024-02-16 14:22:56.282 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000
0065D00000033                                                                                                       
2024-02-16 14:22:56.282 UTC [18] WARNING:  archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later                                                                                        
2024-02-16 14:22:56.303 UTC [18] LOG:  archive command failed with exit code 1
2024-02-16 14:22:56.303 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000
0065D00000033                                                                                                                                                                                                                           
2024-02-16 14:22:56.840 UTC [18] LOG:  archive command failed with exit code 1
2024-02-16 14:22:56.840 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000
0065D00000033                                                                                                                                                                                                                           
2024-02-16 14:22:57.360 UTC [18] LOG:  archive command failed with exit code 1
2024-02-16 14:22:57.360 UTC [18] DETAIL:  The failed archive command was: test ! -f /var/lib/postgresql/data/wal_archive/000000010000065D00000033 && cp pg_wal/000000010000065D00000033 /var/lib/postgresql/data/wal_archive/00000001000
0065D00000033                                                                                                       
2024-02-16 14:22:57.360 UTC [18] WARNING:  archiving write-ahead log file "000000010000065D00000033" failed too many times, will try again later

In addition, the VACUUM is at least so far only reducing the amount of available free space, not increasing as I'd hope.

That WAL file exists both in pg_wal and wal_archive:

ubuntu@dbapp:/var/lib/postgresql/data$ sudo ls -la pgdata/pg_wal/000000010000065D00000033 wal_archive/000000010000065D00000033
-rw-r-xr-x 1 database database 16777216 Jan 12 04:45 pgdata/pg_wal/000000010000065D00000033
-rw------- 1 database database 11395072 Jan 12 04:38 wal_archive/000000010000065D00000033

I tried removing the duplicate file from the archive:

ubuntu@dbapp:/var/lib/postgresql/data$ sudo mv wal_archive/000000010000065D00000033 /root

That did something, and now the disk space is going down rapidly.

I have no name!@550490a61acf:/var/lib/postgresql/data$ pg_controldata | grep "Latest checkpoint's REDO WAL file"
Latest checkpoint's REDO WAL file:    00000001000006990000002B
I have no name!@550490a61acf:/var/lib/postgresql/data$ pg_archivecleanup -d wal_archive/ 00000001000006990000002B

Ok, the database is back up now with some breathing room.

/dev/sdb        501G  348G  131G  73% /var/lib/postgresql

wal_archive itself is tiny, but seems to be increasing. I'll check later where that and the total disk space usage ends up.

ubuntu@dbapp:~$ du -sh /var/lib/postgresql/data/wal_archive/
5.8G    /var/lib/postgresql/data/wal_archive/

@Andrew seems like Trove configures PostgreSQL to just keep WALs indefinitely? Or at least that's what https://docs.openstack.org/trove/latest/admin/database_management.html suggests to me.

The archive folder is growing at a very worrying rate:

root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        501G  428G   51G  90% /var/lib/postgresql
root@dbapp:/var/lib/postgresql/data# du -sh pgdata/ wal_archive/
320G	pgdata/
109G	wal_archive/

I repeated T355138#9550687.

root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        501G  323G  156G  68% /var/lib/postgresql
root@dbapp:/var/lib/postgresql/data# du -sh *
320G	pgdata
3.2G	wal_archive

Disk usage seems to finally have stabilized at this:

root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        501G  398G   81G  84% /var/lib/postgresql
root@dbapp:/var/lib/postgresql/data# du -sh *
320G	pgdata
79G	wal_archive

Something has happened and now most of the disk is free:

root@dbapp:/var/lib/postgresql/data# df -h /var/lib/postgresql/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        501G  189G  290G  40% /var/lib/postgresql
root@dbapp:/var/lib/postgresql/data# du -sh *
111G	pgdata
79G	wal_archive

In particular the pg_wal folder has shrunk by quite a bit.

root@dbapp:/var/lib/postgresql/data/pgdata# du -sh pg_wal
1.1G	pg_wal

So it seems like the real reason for the doubled disk space usage was that the disk initially filling up caused a WAL file archival to get stuck, and due to that it wasn't getting cleaned up properly. Me unblocking that started archival of a backlog of WAL files that took quite a bit of space initially, but now that they were properly archived and cleaned up the disk is back to an expected level of usage.

@YonatanWMIL I think you can now re-enable the the daily data generating jobs, and I'll keep an eye on the instance for a few days just in case something is still broken (although I doubt that's the case anymore).

Still looks good:

ubuntu@dbapp:~$ df -h /var/lib/postgresql/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        501G  189G  290G  40% /var/lib/postgresql

@YonatanWMIL did you get around enabling the previously disabled cron jobs again?

I'm starting to fill the gaps that we have due to the DB going down. Should take about a day. Afterwards i'll reactivate the cron job and update.

Took longer than expected, daily update is running now

And the disk space usage still looks okay:

ubuntu@dbapp:~$ df -h /var/lib/postgresql
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        501G  238G  241G  50% /var/lib/postgresql

And in guest-agent.service logs I see that it's pruning some older WAL files. So everything seems to be working normally now.