Page MenuHomePhabricator

Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals
Closed, ResolvedPublic

Description

Backups to bacula for long term storage work nicely- however, because postgress dumps look like this:

/srv/postgres-backup/psql-all-dbs-2022-08-23-21-37.sql.gz

There is little space savings for doing incremental backups (bacula only handles incremental changes at file level), while limiting the retention of those backups (currently, with present hardware limitations, there is a 2 month retention for backups, which means the first backup on log is from the 5th of June):

check_bacula.py netboxdb1002.eqiad.wmnet-Monthly-1st-Wed-production-netbox-postgres
id: 465248, ts: 2022-06-05 11:27:19, type: I, status: T, bytes: 7082896

However, because it is an incremental, and requires a full backup to be recovered, the first usable backup is from 6 of July:

id: 457448, ts: 2022-07-06 02:18:18, type: F, status: T, bytes: 135995152

Ideally, and to simplify many issues, dump-like backups use daily or weekly full backups only, something possible for a 40M backup, but may require a few changes to the script so we backup only the latest backup each time.

Also make sure we are unable to backup partial or ongoing backups, so dumping and bacula don't run at the same time.

Event Timeline

jcrespo renamed this task from Convert Netox postgresql longterm storage (bacula) backups into full backups to Convert Netbox data (PostgresQL) longterm storage backups (bacula) into full backups rather than incrementals.Aug 30 2022, 1:42 PM
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: ayounsi.

because postgress dumps look like this:
/srv/postgres-backup/psql-all-dbs-2022-08-23-21-37.sql.gz

We can make the dumps look like anything we want :)
We could even have fixed file names (psql-all-dbs-00.sql.gz, psql-all-dbs-01.sql.gz, ..., psql-all-dbs-monday.sql.gz, ... ) :D

But trying to keep things simplified, if we're moving the logic of keeping the backups from the local disks to bacula, I'm wondering if we could actually have a single file in bacula /srv/postgres-backup/psql-all-dbs.sql.gz that is sym/hard-linked to the latest hourly backup.

This should allow:

  • to have on disk the last N days of hourly backups for quick restore in case of mistakes or similar issues, those will not be in bacula
  • to optimize space on bacula having a single full file with different "revisions" over time, using mostly incrementals and having a full backup of it every now and then. A daily backup of this single file should be enough IMHO.

This would basically have a very short term (couple of days) set of hourly backups locally and then delegate the any longer term persistence to bacula.
Do we need to keep any older data compared to what's in bacula? If so we could keep monthly backups locally too...

We can make the dumps look like anything we want :)

No issue with the naming- I meant the formatting (it is a compressed, single file bundle, so bacula will just copy the whole thing even if only 1 line was changed). This is not special of PostgresQL, both GitLab and dbbackups suffer from the same limitation, so -in the long term- I would like to move those to a common framework (wmfbackups), specially for the free monitoring they will get.

For the short term, the idea is to have:

/srv/postgres-backup/ongoing/ (for running backups)
/srv/postgres-backup/latest/ (for the latest backup, to be used by bacula)
/srv/postgres-backup/archive/ (for older backups - a few days would be enough)

while keeping the same name.

Also improving a little bit the error handling, I haven't checked, but I think the postgres dump could fail and the script doesn't, as long as the compression is successful, but I may be wrong.

If we use a single file for bacula we can keep it uncompressed I guess. It's ~10 times larger though.

I'm wondering if we could actually have a single file in bacula /srv/postgres-backup/psql-all-dbs.sql.gz that is sym/hard-linked to the latest hourly backup.

Simlinks (at least soft ones, haven't checked with hard ones) are copied as is, those don't work. We tested this for GitLab and they ended up doing something similar to the above + full backups.

If we use a single file for bacula we can keep it uncompressed I guess. It's ~10 times larger though.

No need, just the dir structure so only one file is copied every time would be enough. Other options are possible too, but I think this will be nice for later integrating that to wmfbackups, in the long future.

@jcrespo resuming this from our backlog. Could you please re-check what's the current status of those backups and advise on next steps (based also on the above discussion)?

This is the list of backups we keep:

$ check_bacula.py netboxdb1002.eqiad.wmnet-Monthly-1st-Wed-production-netbox-postgres
id: 465248, ts: 2022-06-05 11:27:19, type: I, status: T, bytes: 7082896
id: 466887, ts: 2022-06-10 05:21:28, type: I, status: T, bytes: 4013749
id: 467085, ts: 2022-06-11 04:57:34, type: I, status: T, bytes: 4030085
id: 467263, ts: 2022-06-12 04:30:01, type: I, status: T, bytes: 4026933
id: 468851, ts: 2022-06-17 05:14:16, type: I, status: T, bytes: 7599733
id: 469041, ts: 2022-06-18 04:45:48, type: I, status: T, bytes: 12314453
id: 470585, ts: 2022-06-23 05:04:47, type: I, status: T, bytes: 9418837
id: 470728, ts: 2022-06-24 05:17:42, type: I, status: T, bytes: 10282149
id: 470919, ts: 2022-06-25 05:20:36, type: I, status: T, bytes: 11086469
id: 472362, ts: 2022-06-30 05:37:32, type: I, status: T, bytes: 15051829
id: 472534, ts: 2022-07-01 05:26:32, type: I, status: T, bytes: 15861173
id: 455848, ts: 2022-07-02 05:52:55, type: I, status: T, bytes: 16661312
id: 455989, ts: 2022-07-03 10:42:56, type: I, status: T, bytes: 35698800
id: 456128, ts: 2022-07-04 04:58:00, type: I, status: T, bytes: 0
id: 457383, ts: 2022-07-05 17:21:41, type: I, status: T, bytes: 38518496
id: 457448, ts: 2022-07-06 02:18:18, type: F, status: T, bytes: 135995152
id: 457535, ts: 2022-07-06 05:03:20, type: I, status: T, bytes: 0
id: 457683, ts: 2022-07-07 06:36:17, type: I, status: T, bytes: 19975552
id: 457814, ts: 2022-07-08 04:56:18, type: I, status: T, bytes: 20518272
id: 457945, ts: 2022-07-09 05:03:02, type: I, status: T, bytes: 21026432
id: 458077, ts: 2022-07-10 04:44:28, type: I, status: T, bytes: 21470384
id: 458210, ts: 2022-07-11 04:32:06, type: I, status: T, bytes: 21896720
id: 459440, ts: 2022-07-12 05:01:12, type: I, status: T, bytes: 22328144
id: 459578, ts: 2022-07-13 05:25:26, type: I, status: T, bytes: 22763232
id: 459711, ts: 2022-07-14 05:03:12, type: I, status: T, bytes: 23007392
id: 459865, ts: 2022-07-15 05:03:31, type: I, status: T, bytes: 23150256
id: 460007, ts: 2022-07-16 04:37:38, type: I, status: T, bytes: 23266496
id: 460151, ts: 2022-07-17 04:45:41, type: I, status: T, bytes: 23389184
id: 460293, ts: 2022-07-18 04:23:00, type: I, status: T, bytes: 23470704
id: 461526, ts: 2022-07-19 04:48:46, type: I, status: T, bytes: 47146080
id: 461593, ts: 2022-07-20 03:05:51, type: D, status: T, bytes: 209862400
id: 461683, ts: 2022-07-20 05:18:40, type: I, status: T, bytes: 0
id: 461836, ts: 2022-07-21 04:52:57, type: I, status: T, bytes: 23781408
id: 461967, ts: 2022-07-22 04:47:36, type: I, status: T, bytes: 23893824
id: 462097, ts: 2022-07-23 04:34:46, type: I, status: T, bytes: 24031600
id: 462228, ts: 2022-07-24 04:28:32, type: I, status: T, bytes: 24124352
id: 462360, ts: 2022-07-25 05:29:20, type: I, status: T, bytes: 24221024
id: 463535, ts: 2022-07-26 05:00:32, type: I, status: T, bytes: 24309456
id: 463672, ts: 2022-07-27 05:23:41, type: I, status: T, bytes: 24404128
id: 463804, ts: 2022-07-28 05:01:50, type: I, status: T, bytes: 24899856
id: 463937, ts: 2022-07-29 05:04:08, type: I, status: T, bytes: 25345264
id: 464070, ts: 2022-07-30 04:57:28, type: I, status: T, bytes: 26000336
id: 464204, ts: 2022-07-31 04:38:57, type: I, status: T, bytes: 26453712
id: 464348, ts: 2022-08-01 04:51:49, type: I, status: T, bytes: 26898464
id: 465491, ts: 2022-08-03 07:06:50, type: I, status: T, bytes: 55215760
id: 465560, ts: 2022-08-03 07:52:45, type: F, status: T, bytes: 237572992
id: 465650, ts: 2022-08-03 09:34:23, type: I, status: T, bytes: 83571232
id: 465799, ts: 2022-08-04 06:10:17, type: I, status: T, bytes: 0
id: 465955, ts: 2022-08-05 05:05:19, type: I, status: T, bytes: 28848880
id: 466095, ts: 2022-08-06 05:17:29, type: I, status: T, bytes: 29366880
id: 466240, ts: 2022-08-07 11:00:52, type: I, status: T, bytes: 60187440
id: 466372, ts: 2022-08-08 04:20:03, type: I, status: T, bytes: 0
id: 467459, ts: 2022-08-09 05:00:42, type: I, status: T, bytes: 123335296
id: 467596, ts: 2022-08-10 05:01:06, type: I, status: T, bytes: 31332400
id: 467728, ts: 2022-08-11 04:53:34, type: I, status: T, bytes: 31863616
id: 467858, ts: 2022-08-12 04:38:37, type: I, status: T, bytes: 32453776
id: 467986, ts: 2022-08-13 04:39:41, type: I, status: T, bytes: 32939568
id: 468114, ts: 2022-08-14 04:27:53, type: I, status: T, bytes: 33411664
id: 468252, ts: 2022-08-15 04:30:51, type: I, status: T, bytes: 33901184
id: 469280, ts: 2022-08-16 04:39:34, type: I, status: T, bytes: 34387568
id: 469348, ts: 2022-08-17 03:05:50, type: D, status: T, bytes: 296000768
id: 469434, ts: 2022-08-17 05:32:26, type: I, status: T, bytes: 0
id: 469581, ts: 2022-08-18 04:54:43, type: I, status: T, bytes: 35381280
id: 469732, ts: 2022-08-19 04:37:01, type: I, status: T, bytes: 35880480
id: 469866, ts: 2022-08-20 04:35:48, type: I, status: T, bytes: 36347632
id: 470009, ts: 2022-08-21 04:20:31, type: I, status: T, bytes: 36834704
id: 470138, ts: 2022-08-22 04:26:50, type: I, status: T, bytes: 37295168
id: 471108, ts: 2022-08-23 04:48:51, type: I, status: T, bytes: 37782544
id: 471242, ts: 2022-08-24 05:21:17, type: I, status: T, bytes: 38216256
id: 471369, ts: 2022-08-25 05:05:27, type: I, status: T, bytes: 38696192
id: 471498, ts: 2022-08-26 04:55:31, type: I, status: T, bytes: 39151152
id: 471625, ts: 2022-08-27 04:42:30, type: I, status: T, bytes: 39612720
id: 471753, ts: 2022-08-28 04:28:22, type: I, status: T, bytes: 40090576
id: 471882, ts: 2022-08-29 04:27:55, type: I, status: T, bytes: 40551264
id: 472823, ts: 2022-08-30 04:49:17, type: I, status: T, bytes: 41039552
id: 472958, ts: 2022-08-31 05:23:41, type: I, status: T, bytes: 41540912

The advice at T316655#8198609 still applies fully as I don't like how that list looks, and it would be -I think- two small patches. Is there something that is unclear of what should be the ideal backup workflow here? Or if it is just a question of a refresher because too much time has passed, feel free to ping my by IRC.

It was mostly a refresher because too much time has passed and double checking if anything had changed since then.
Just to understand correctly, for Bacula is better/required to have a directory right? (your proposed /srv/postgres-backup/latest/ )
Because from our point of view we could even have a single file with the latest backup..

File or directory is the same- but it has to have an exact name, contain no other file and not use soft links. Then we will switch to full backups rather than incrementals. I suggested the directory model because at some point we would like to integrate that into wmfbackups- but that is out of scope here.

Perfect. Last question, should we keep it compressed or could make sense to have it uncompressed to allow for the deltas to work efficiently?

If the idea is to create full backups, it should be compressed- deltas won't work unless files are byte-identical. That is why I want to switch to full backups, because generally each dump will be different.

Volans triaged this task as Medium priority.

Adding Data-Engineering as the change will also affect an-db[1001-1002].eqiad.wmnet in addition to the netbox DBs.

Change 994184 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] postgres backups: add hard link for latest

https://gerrit.wikimedia.org/r/994184

Change 994184 merged by Volans:

[operations/puppet@production] postgres backups: add hard link for latest

https://gerrit.wikimedia.org/r/994184

Let me know when you have a new backup in the new format to test and deploy the bacula changes.

Change 994742 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] netboxdb: increase local backup retention

https://gerrit.wikimedia.org/r/994742

Change 994743 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] netboxdb: change bacula settings for DB backup

https://gerrit.wikimedia.org/r/994743

@jcrespo we do have our first backup in netboxdb2002:

-rw-r--r-- 1 postgres postgres 46395077 Jan 31 13:37 psql-all-dbs-2024-01-31-13-37.sql.gz
-rw-r--r-- 2 postgres postgres 46415716 Jan 31 14:37 psql-all-dbs-latest.sql.gz
-rw-r--r-- 2 postgres postgres 46415716 Jan 31 14:37 psql-all-dbs-2024-01-31-14-37.sql.gz

and I forced one in netboxdb1002:

-rw-r--r-- 1 postgres postgres  45M Jan 31 01:23 psql-all-dbs-2024-01-31-01-23.sql.gz
-rw-r--r-- 2 postgres postgres  45M Jan 31 14:55 psql-all-dbs-2024-01-31-14-55.sql.gz
-rw-r--r-- 2 postgres postgres  45M Jan 31 14:55 psql-all-dbs-latest.sql.gz

I've sent the above patch (that might need fine-tuning) to start using it.

Let me do a manual run first and then we can do the followups (I also need to deploy the change backup setting to default to full).

If it is a manual and extra run, and not a lot of inconvenience, can we delete both the original name and its hard link it after backup to simulate a full loss? (we can always do a new run afterwards).

Change 994742 merged by Volans:

[operations/puppet@production] netboxdb: increase local backup retention

https://gerrit.wikimedia.org/r/994742

Change 994743 merged by Volans:

[operations/puppet@production] netboxdb: change bacula settings for DB backup

https://gerrit.wikimedia.org/r/994743

All tests with @jcrespo for netbox were successful, patches merged, this should be solved for the Netbox side of things.

@BTullis @brouberol I'll leave this task open for you to decide what to do regarding an-db[1001-1002].eqiad.wmnet. Once you have your first backup that includes the hardlink with the 'latest' filename you can prepare a patch like https://gerrit.wikimedia.org/r/c/operations/puppet/+/994743 and test it with @jcrespo

Volans reopened this task as Open.
Volans moved this task from Backlog to Complete on the netbox board.
Volans moved this task from Backlog to Done on the Infrastructure-Foundations board.

Sorry, moving in the dashboard column automatically resolved it, re-opening.

Change 997935 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] [DPE Postgres] Only backup the latest postgres dump file

https://gerrit.wikimedia.org/r/997935

Change 998337 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update data-engineering-postgresql bacula job defaults

https://gerrit.wikimedia.org/r/998337

Change 997935 merged by Btullis:

[operations/puppet@production] [DPE Postgres] Only backup the latest postgres dump file

https://gerrit.wikimedia.org/r/997935

Change 998337 merged by Btullis:

[operations/puppet@production] Update data-engineering-postgresql bacula job defaults

https://gerrit.wikimedia.org/r/998337

This is looking good- I will check in the following days that an db backups are being produced (the data copy for Bacula, the dump correctness at the moment should be the service owner responsability 0:-) ).

Do you know if there are other postgres dbs using this tool for backups, that we should convert, or is it ok for now to resolve this?

@BTullis thanks for the follow up for your hosts.

@jcrespo I think we're good, this specific backup approach is setup via the postgresql::backup class in Puppet and that matches only hosts already converted.

$ sudo cumin 'C:postgresql::backup'
5 hosts will be targeted:
an-db[1001-1002].eqiad.wmnet,netbox-dev2002.codfw.wmnet,netboxdb2002.codfw.wmnet,netboxdb1002.eqiad.wmnet

Resolving for now, feel free to re-open if there is any issue with the latest change in the next few days.

an-db backups looking good:

✔️ root@backup1001:~$ check_bacula.py an-db1001.eqiad.wmnet-Daily-productionEqiad-data-engineering-postgres
id: 552007, ts: 2024-02-08 04:05:00, type: F, status: T, bytes: 643284416
id: 552157, ts: 2024-02-09 04:05:01, type: F, status: T, bytes: 645145616
id: 552309, ts: 2024-02-10 04:05:00, type: F, status: T, bytes: 646974416
id: 552458, ts: 2024-02-11 04:05:00, type: F, status: T, bytes: 648762976
id: 552609, ts: 2024-02-12 04:05:00, type: F, status: T, bytes: 650570592
id: 552760, ts: 2024-02-13 04:05:00, type: F, status: T, bytes: 652389680
id: 552918, ts: 2024-02-14 04:05:01, type: F, status: T, bytes: 654207808
id: 553088, ts: 2024-02-15 04:05:00, type: F, status: T, bytes: 655958880
id: 553260, ts: 2024-02-16 04:05:01, type: F, status: T, bytes: 657687120
id: 553428, ts: 2024-02-17 04:05:00, type: F, status: T, bytes: 659326944
id: 553597, ts: 2024-02-18 04:05:00, type: F, status: T, bytes: 660873424
id: 553760, ts: 2024-02-19 04:05:00, type: F, status: T, bytes: 662617920
id: 553924, ts: 2024-02-20 04:05:00, type: F, status: T, bytes: 665207840
id: 554109, ts: 2024-02-21 04:05:01, type: F, status: T, bytes: 630091008
✔️ root@backup1001:~$ check_bacula.py an-db1002.eqiad.wmnet-Daily-productionEqiad-data-engineering-postgres
id: 552008, ts: 2024-02-08 04:05:00, type: F, status: T, bytes: 182492896
id: 552158, ts: 2024-02-09 04:05:01, type: F, status: T, bytes: 193882176
id: 552310, ts: 2024-02-10 04:05:01, type: F, status: T, bytes: 178497152
id: 552459, ts: 2024-02-11 04:05:00, type: F, status: T, bytes: 171867232
id: 552610, ts: 2024-02-12 04:05:00, type: F, status: T, bytes: 182933552
id: 552761, ts: 2024-02-13 04:05:00, type: F, status: T, bytes: 177721968
id: 552919, ts: 2024-02-14 04:05:01, type: F, status: T, bytes: 178077264
id: 553089, ts: 2024-02-15 04:05:08, type: F, status: T, bytes: 193632432
id: 553261, ts: 2024-02-16 04:05:01, type: F, status: T, bytes: 179027088
id: 553429, ts: 2024-02-17 04:05:00, type: F, status: T, bytes: 179340992
id: 553598, ts: 2024-02-18 04:05:00, type: F, status: T, bytes: 189788240
id: 553761, ts: 2024-02-19 04:05:00, type: F, status: T, bytes: 179704320
id: 553925, ts: 2024-02-20 04:05:08, type: F, status: T, bytes: 180750272
id: 554110, ts: 2024-02-21 04:05:02, type: F, status: T, bytes: 180212720