Page MenuHomePhabricator

Netbox: setup backups
Closed, ResolvedPublic

Description

Before start using it as a source of truth in production we need to setup its backups, I think it's only the PostgreSQL DB, but I let @ayounsi confirm it.

Event Timeline

Volans triaged this task as Medium priority.Mar 20 2018, 4:48 PM
Volans created this task.

Correct, plus:
/srv/deployment/netbox/deploy/netbox/netbox/media/
and
/srv/deployment/netbox/deploy/netbox/netbox/reports/

As Netbox will be il full production as part of the quarterly goal, this needs to be done before that.

Change 447744 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: have Bacula backups of netbox hosts

https://gerrit.wikimedia.org/r/447744

Change 447747 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: define a Bacula fileset and apply it

https://gerrit.wikimedia.org/r/447747

The changes above will a) turn netbox hosts into "backup::hosts" via profile so they get all the needed Bacula agent stuff and then b) define a new fileset (paths to be backed up for netbox and apply that.

The missing part would then just be a cron to dump pgsql into a path and add that path to the fileset.

Change 447744 merged by Dzahn:
[operations/puppet@production] netbox: have Bacula backups of netbox hosts

https://gerrit.wikimedia.org/r/447744

Change 447747 merged by Dzahn:
[operations/puppet@production] netbox: define a new Bacula fileset and apply it

https://gerrit.wikimedia.org/r/447747

Change 447842 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] netbox: add psql dump cron and back it up

https://gerrit.wikimedia.org/r/447842

Change 447844 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql: add defined type to create db backups

https://gerrit.wikimedia.org/r/447844

Change 447844 merged by Dzahn:
[operations/puppet@production] postgresql: add class to create db backups

https://gerrit.wikimedia.org/r/447844

Change 447842 merged by Dzahn:
[operations/puppet@production] netbox: add psql backups

https://gerrit.wikimedia.org/r/447842

Change 449387 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql::backups: fix typo and amend usage notes

https://gerrit.wikimedia.org/r/449387

Change 449387 merged by Dzahn:
[operations/puppet@production] postgresql::backups: fix typo and amend usage notes

https://gerrit.wikimedia.org/r/449387

There is now a puppetized dir and cron job on netbox machines. Copy/pasting the command from there and executing manually as user postgres works and creates a gzipped dump file in /srv/postgres-backup. (which i deleted again)

sudo -u postgres /usr/bin/pg_dumpall | gzip > /srv/postgres-backup/psql-all-dbs-`date "+%Y%m%d"`.sql.gz
root@netmon1002:/srv/postgres-backup# ls
psql-all-dbs-20180731.sql.gz

The path /srv/postgres-backup has been added to the Bacula filset called "netbox".

Tomorrow i will double-confirm it got created (at a random time), shows up in Bacula console and can be restored from there.

Change 449409 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Fix typo in postgresql::backup

https://gerrit.wikimedia.org/r/449409

Change 449409 merged by Volans:
[operations/puppet@production] Fix typo in postgresql::backup

https://gerrit.wikimedia.org/r/449409

Change 449411 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Fix typo in postgresql::backup (2)

https://gerrit.wikimedia.org/r/449411

Change 449411 merged by Volans:
[operations/puppet@production] Fix typo in postgresql::backup (2)

https://gerrit.wikimedia.org/r/449411

Change 449607 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bacula/postgresql: add a generic fileset for psql

https://gerrit.wikimedia.org/r/449607

Change 449607 merged by Dzahn:
[operations/puppet@production] bacula/postgresql: add a generic fileset for psql

https://gerrit.wikimedia.org/r/449607

Change 449874 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql::backup: fix "EOF in backquote substitution"

https://gerrit.wikimedia.org/r/449874

Change 449874 merged by Dzahn:
[operations/puppet@production] postgresql::backup: fix "EOF in backquote substitution"

https://gerrit.wikimedia.org/r/449874

Change 450061 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] posgresql::backup: fix "Unterminated quoted string" in cron tab

https://gerrit.wikimedia.org/r/450061

Change 450061 merged by Dzahn:
[operations/puppet@production] posgresql::backup: fix "Unterminated quoted string" in cron tab

https://gerrit.wikimedia.org/r/450061

Change 450257 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql::backup: don't run both crons at same minute

https://gerrit.wikimedia.org/r/450257

Change 450261 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql::dump: unify commands in a single cron job

https://gerrit.wikimedia.org/r/450261

Change 450257 merged by Dzahn:
[operations/puppet@production] postgresql::backup: don't run both crons at same minute

https://gerrit.wikimedia.org/r/450257

Finally there should be no more cronspam :p ... and we have actual files:

root@netmon1002:/srv/postgres-backup# ls
psql-all-dbs-20180803.sql.gz

root@netmon2001:/srv/postgres-backup# ls
psql-all-dbs-20180803.sql.gz

Also the dump and the cleanup cron don't run at the same time anymore.

On the Bacula in bconsole i can confirm the new netbox backup sets exist and srv/postgres-backup is part of a file set that could be restored from/to netmon1002 and netmon2001 as well. The actual dump files that exist now should appear shortly in the next incremental backup.

Files have been created for each day and are showing up in Bacula (helium) in bconsole now.

(how to is at: https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode))

bconsole:

$ cd postgres-backup/ 
cwd is: /srv/postgres-backup/
$ ls
psql-all-dbs-20180803.sql.gz
psql-all-dbs-20180804.sql.gz
psql-all-dbs-20180805.sql.gz
psql-all-dbs-20180806.sql.gz

I started a restore to netmon1002.

Using Catalog "production"
Run Restore job
JobName:         RestoreFiles
Bootstrap:       /var/lib/bacula/helium.eqiad.wmnet.restore.4.bsr
Where:           /var/tmp/bacula-restores
..
Backup Client:   netmon1002.wikimedia.org-fd
Restore Client:  netmon1002.wikimedia.org-fd
..
Job queued. JobId=103268

Mentioned in SAL (#wikimedia-operations) [2018-08-07T18:24:08Z] <mutante> netbox - restored database from dump file - backed up and back-up (T190184)

I also tested actual restore of the database:

Dropped the live prod database, confirmed web UI was down, then restored DB from dumpfile from backups and the application was up again.

I created a wikitech page for Netbox incl. a section on backups and restore:

https://wikitech.wikimedia.org/wiki/Netbox#Restore

This should resolve the ticket.

Change 463820 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Netbox: set media directory

https://gerrit.wikimedia.org/r/463820

Change 463820 abandoned by Volans:
Netbox: set media directory

Reason:
Superseeded by other patches by Cas to move directly to Swift

https://gerrit.wikimedia.org/r/463820