UPDATE: Data has been migrated to backup[12]00[89]- only pending task is to migrating the bacula-director and send backup[12]001 and its arrays for decommission.
We are running low on resources for backups- retention had to be reduced from 3 to 2 months to make sure we didn't run out of space.
backup[12]008 and backup[12]009 were recently purchased and setup, migrate the production backups to 9, alongside the director, and the database backups to 8. This will allow us to increase the retnention back to 3 months, and potentially increase the database backups coverage (longer retention, more frequency of backups, long retention of snapshots, etc., TBD by DBA team) as the older host (backup[12]001 will be free for extra uses).
At the same time that the hardware maintenance is happening, we want to increase redundancy and reliability of backups across datacenters. While right now backups are redundantly copyied to backup1001, we still have a single director- and while it is trivial in theory to setup an additional one if the existing one fails, it would be 1) easier to be already active 2) more transparent to be able to create independent copies and recoveries without the original one failing. This model was tested successfully when database backups were setup and we want to extend it to all other backups.
In order to make this change, we will have to setup a parallel full bacula stack on codfw, and test it is working before modifying the active eqiad one. We should also keep existing backedup files for the retention they expect to have- no existing backup should be deleted unless they are redundant or expired (>2-3 months).
Additionally, we may or may not need an extra active codfw database for bacula on codfw (for mysql redundancy, too).