Page MenuHomePhabricator

Netbox: generate CSV backups
Closed, ResolvedPublic

Description

Currently we're performing Postgresql backups of Netbox, but if a wrong edit is made it would be pretty hard to manually find the right values for a quick revert without having to restore the whole DB, potentially losing changes made by others.
We could, in addition to the DB backup, also perform backups in CSV form of the data, using the export to CSV function in Netbox.
A script in the netbox-deploy repo, that uses the already existing token should do the job and should be fairly simple to add.

Caveat: for the DCIM devices we should use the custom export all fields CSV method instead of the default one.

Things to be decided:

  • which objects to export (all?)
  • how frequent to perform the backup
  • in which structure
    • For this my suggestion would be something like:
netbox-csv-backups/
    2019-05-14/
        dcim.devices.csv
        dcim.sites.csv
        ....
  • how/when to rotate/compress the files

Event Timeline

Volans triaged this task as Medium priority.May 14 2019, 3:47 PM

Just to follow up on this. I did spend some time trying to figure out how to initiate a template-based export from hitting a URL. It seems as though there's no API-way, and hitting the URL endpoint doesn't work with a token authentication as far as I can tell.

That said, it'd be relatively trivial to use the API to export things, so I propose that we proceed by basically implementing the same functionality except via the API.

Change 518166 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/software/netbox-deploy@master] Add new dumpbackup.py script

https://gerrit.wikimedia.org/r/518166

Change 518166 merged by CRusnov:
[operations/software/netbox-deploy@master] Add new dumpbackup.py script

https://gerrit.wikimedia.org/r/518166

Mentioned in SAL (#wikimedia-operations) [2019-08-13T18:31:47Z] <crusnov@deploy1001> Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292

Mentioned in SAL (#wikimedia-operations) [2019-08-13T18:32:23Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (duration: 00m 36s)

Mentioned in SAL (#wikimedia-operations) [2019-08-13T18:32:31Z] <crusnov@deploy1001> Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292

Mentioned in SAL (#wikimedia-operations) [2019-08-13T18:33:09Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (duration: 00m 43s)

Mentioned in SAL (#wikimedia-operations) [2019-08-13T18:33:39Z] <crusnov@deploy1001> Started deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (fix perms)

Mentioned in SAL (#wikimedia-operations) [2019-08-13T18:33:49Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@367ca84]: Update Netbox to v2.6.1-wmf3 affects: T223292 (fix perms) (duration: 00m 09s)

this has been fully deployed now and tested. It is automated.

faidon raised the priority of this task from Medium to High.

It looks like this just dumps files flat, without keeping any archives. We've already lost a bunch of history unfortunately :(

The original description stated:

netbox-csv-backups/
    2019-05-14/
        dcim.devices.csv
        dcim.sites.csv
        ...

…and I think that's a much better idea. Let's do that ASAP.

It looks like this just dumps files flat, without keeping any archives. We've already lost a bunch of history unfortunately :(

The original description stated:

netbox-csv-backups/
    2019-05-14/
        dcim.devices.csv
        dcim.sites.csv
        ...

…and I think that's a much better idea. Let's do that ASAP.

You are correct that this was the spec, however for a time it was agreed between Riccardo and myself that using bacula for history would be preferable.

Based on feedback from you subsequent to its release, I have implemented a rotation strategy for this, I need to revisit the patch but it has been signed off on so I will get that merged this coming week.

Ping! I'd like to start killing old entries from esams, but I'd like to make sure we have them backed up first.

Ah roger, just need a quick deploy and done.

Mentioned in SAL (#wikimedia-operations) [2019-10-25T14:27:07Z] <crusnov@deploy1001> Started deploy [netbox/deploy@690f9ae]: deploy netbox scripts T223292

Mentioned in SAL (#wikimedia-operations) [2019-10-25T14:28:09Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@690f9ae]: deploy netbox scripts T223292 (duration: 01m 02s)

Mentioned in SAL (#wikimedia-operations) [2019-10-25T14:30:48Z] <crusnov@deploy1001> Started deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) T223292

Mentioned in SAL (#wikimedia-operations) [2019-10-25T14:30:53Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) T223292 (duration: 00m 05s)

Mentioned in SAL (#wikimedia-operations) [2019-10-25T14:31:45Z] <crusnov@deploy1001> Started deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) -T223292

Mentioned in SAL (#wikimedia-operations) [2019-10-25T14:32:29Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@690f9ae]: deploy netbox scripts (netbox2001) -T223292 (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2019-10-25T16:04:08Z] <crusnov@deploy1001> Started deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox2001) T223292

Mentioned in SAL (#wikimedia-operations) [2019-10-25T16:04:51Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox2001) T223292 (duration: 00m 43s)

Mentioned in SAL (#wikimedia-operations) [2019-10-25T16:05:50Z] <crusnov@deploy1001> Started deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox1001) T223292

Change 545123 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] netbox: Enable CSV dump rotations.

https://gerrit.wikimedia.org/r/545123

Change 545123 merged by CRusnov:
[operations/puppet@production] netbox: Enable CSV dump rotations.

https://gerrit.wikimedia.org/r/545123

Mentioned in SAL (#wikimedia-operations) [2019-10-25T16:19:20Z] <crusnov@deploy1001> Finished deploy [netbox/deploy@0f4c92d]: deploy netbox scripts update (netbox1001) T223292 (duration: 13m 31s)

Oke doke, rotations are in place. Just a note the old pre-rotation dumps are still backed up in bacula, so those are available for historical data, but now we'll have timestamped dumps with several historical dumps in addition to being backed up to bacula.

After running the numbers and looking at the way the rotations work, currently deployed version dumps 24 times a day and saves 16 of them, this doesn't seem that useful, so there is a patch https://gerrit.wikimedia.org/r/#/c/operations/software/netbox-deploy/+/546241 that changes the script to overwrite the daily directory with today's date (so it'll dump repeatedly to 2019-10-25 until 2019-10-26), and then rotates after 365 such dumps. This seems more useful in general.

Also it is trivial to create a persistent dump for example at the occasion of a major change to Netbox, by doing a manual dump and renaming it to something like "esams_purge-2019-10-25", this will preserve it from rotation since the rotate script only looks at directories that start with 20*.

As I've not yet fully understood the use case of those files given that AFAIK most of them cannot be re-imported as is into Netbox it's hard for me to give a feedback on the frequency of the backups and their retention.
If I have to ballpark it while keeping it simple then the standard hourly for a week, daily for the rest of the retention period might be a good compromise.

I believe there has been progress here since the last update. @crusnov what's the latest?

The above patch [1] has not yet been merged.

To summarize the current status:

  • we have hourly dumps at minute 37 from both netbox hosts at the same time (not optimal!) since Oct 25th
  • the garbage collection is not having effect because of an error (globbing inside double quotes), resulting in us keeping all the hourly backup since Oct. 25th. If it was working we would be keeping only the last 16 hours of dumps
  • we're trying to garbage collect also *.json files, but I don't see any in the dumps directory

[1] https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/546241

Yep, revisiting the rotation right now. We in any case have not *lost* anything, it is just non-optimal.

An update to this. After having to had restored a few entries and also writing a script to make the CSV dumps actually usable for import, I will do some more. work to make this a more convenient process overall.

General plan:

  • Push rotation changes into production
  • Integrate dumper into a customscript so that it's not killing the servers to do dumps
  • Make a version of the CSVs more appropriate for restoration.
    • They try to make loading CSVs do a lot of magic, so fields that one would presume were meant to be slugs and things are supposed to be the full names of things. So we at least need one version of devices_full that is restorable.
  • Make a script to grep out the most recent version of a particular device record.
crusnov lowered the priority of this task from High to Medium.May 21 2020, 11:36 PM
crusnov moved this task from Patches / Testing / Pending to Complete on the netbox board.

Most of the above has been done in separate tasks, only the script to grep out the most recent version of a particular device record is missing and not really needed right now. Resolving.