Page MenuHomePhabricator

Define future design of GitLab backups
Closed, ResolvedPublic

Description

Current backup issues

GitLab backups were configured in T274463. After some tweaks they work reliably now. However during the implementation it was clear that we will reach scaling limits with the current solution. With more adoption and usage of GitLab, the key problems are:

  • backups take quite a lot of space ("more" in comparison to Gerrit)
  • backups need even more space during creation
  • backup creation takes quite a lot of time
  • backups are done every 24h

So long-term plan is needed how we want to do backups for GitLab. This design should include how the backups are created, backup frequency, storage location, rotation and if possible also monitoring of backups. Furthermore we need some estimates regarding disk usage and backup and restore times for future backup sizes.

Potential solutions

Some ideas discussed in T274463 and other tasks which could influence the design:

  • order hosts with bigger disks/get a second pair of disks (and tweak disk layout)
  • Setup partial/incremental backups for GitLab T316935 / T324506
  • use alternative backup strategies for example bacula, rsync, pgbouncer/dump - Introduce a lot of complexity and are not officially supported by GitLab
  • use alternative backup storage locations like a dedicated s3 bucket/minio/cloud or a locally mounted share - Doesn't solve the problem of local disk space needed to create the backup files

Final backup design

Some experimentation and discussion happened to evaluate the different options. See T330172#8657895 and related changes. We agreed to add additional disks to the GitLab hosts to mitigate the disk space issue (for creation and storage of a single backup).

Backup storage

Two additional 1.7TB disks were added to all GitLab hosts (except gitlab2002, which will be done after switchover). This disks are configured as RAID1 and mounted at the default GitLab backup location /srv/gitlab-backup.
The additional disk space gives us plenty of room for future growth and migration to GitLab. Furthermore the root partition also has more space now. According to estimates in T330172#8657895 the space should last for at least two more years.

Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/gitlab1004--vg-root  813G   79G  693G  11% /
/dev/md1                         1.8T   65G  1.6T   4% /srv/gitlab-backup

Backup frequency

Additional disks don't solve the issue of backup duration and frequency. The plan here is to implement incremental backups (T324506)/partial backups (T316935). This backups will contain repository data and database changes but no CI artifacts (builds, packages). The latter make up most of the space and backup runtime.

We already have a partial backup in the backup script(T316935). Furthermore some research is happening in T324506 around incremental backups. So this efforts should be consolidated. The goal is to to keep doing full backups every 24 hours but with incremental backups in between (maybe every 4 hours in the beginning).

Event Timeline

Jelto triaged this task as High priority.Feb 21 2023, 3:34 PM
Jelto moved this task from Incoming to Backlog on the collaboration-services board.

I've done some more research on the above ideas:

Setup partial/incremental backups for GitLab T316935 / T324506

This seems to only work for repositories and a subset of data sources. So this would reduce the size of backups with the cost of doing incomplete backups. For a long term solution, we have to be able to do full backups (for example for switchover), so this does not solve our problem. They can be an option to optimize backups and reduce the delta between backups.

use alternative backup strategies for example bacula, rsync, pgbouncer/dump

This seems to be possible but would introduce quite a lot of complexity. Major refactoring of the backup and restore process would be needed and we have to move away from the build-in backup and restore tooling and use other tools like rsync, lvm snapshots and postgres backups. This is not officially supported by GitLab and needs a custom implementation.

use alternative backup storage locations like a dedicated s3 bucket/minio/cloud or a locally mounted share

Storing backups on a different location would allow us to use more space. However backups are generated locally which needs quite a lot of space. So this doesn't solve the problem of local disk space needed to create the backup files. We are doing that quite similar with bacula already, with sending and storing multiple backups in bacula.

backup size vs backup creation size estimates

Regarding the problem of actual backup size and peak usage during backup creation I created the following dashboard:

gitlab_peak_disk_usage.png (881×1 px, 65 KB)

There you can see that backups grow with about 13.5GB/month with a current backup size of around 50GB (yellow line). So if backups grow similar in the next year, we should see roughly 200GB backup size in 2024.

The problem is that backups need significantly more space during creation, because data is copied and zipped (green line). The green line shows the peak usage during a single day. You can see that this line is roughly three times more than the actual backup size. During version upgrades, were additional backups are created we see even bigger peaks in disk usage (four times the size).
This means for 200GB estimated backup size in 2024, we would need at least four times that size for backup storage to make sure we are able to do backups properly. So the estimated peak usage would be around 800GB.

Order hosts with bigger disks/get a second pair of disks (and tweak disk layout)

This leads to the last and most promising option of scaling the backup disk by either adding bigger disks or ordering new hosts with bigger disks.

We would need at least two additional 800GB disks per host (in RAID1 to make sure we can fail-over to another machine with a failed disk). This needs to be discussed with DC-Ops first and new hardware has to be ordered.

Other options like optimizing the partman config or changing the existing backup strategy seems not to free up that amount of space for the 2024 estimates.

The above estimates consider just the current GitLab usage. At some point we may migrate all projects from gerrit to GitLab. The sizes on gerrit1001 are:

Repositories at /srv/gerrit/git need 45G currently
Additional /srv/gerrit/plugins/lfs need 20G currently.

So we need roughly additional 65G of space, if we want to store the same repository data in GitLab.

In total this would be 265G of estimated GitLab usage (old and new repo data). Factoring in the backup creation peak usage we would end up at around 1060G (4 times 265G). The new disks should give us at least this capacity.

LSobanski mentioned this in Unknown Object (Task).Mar 9 2023, 7:01 PM
LSobanski mentioned this in Unknown Object (Task).

Change 898791 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: use second pair of disks for /srv/gitlab-backup

https://gerrit.wikimedia.org/r/898791

@LSobanski additional ssd drives added to gitlab servers in eqiad

Change 898791 merged by Jelto:

[operations/puppet@production] install_server: use second pair of disks for /srv/gitlab-backup

https://gerrit.wikimedia.org/r/898791

Change 906030 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: simplify gitlab disk layout, drop lvm, use four SSDs

https://gerrit.wikimedia.org/r/906030

Change 906030 merged by Jelto:

[operations/puppet@production] install_server: simplify gitlab disk layout, drop lvm, use four SSDs

https://gerrit.wikimedia.org/r/906030

Change 906565 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: hard code raid sizes for gitlab partman recipe

https://gerrit.wikimedia.org/r/906565

Change 906565 merged by Jelto:

[operations/puppet@production] install_server: hard code raid sizes for gitlab partman recipe

https://gerrit.wikimedia.org/r/906565

Change 906596 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: fix line break in gitlab parman recipe

https://gerrit.wikimedia.org/r/906596

Change 906596 merged by Jelto:

[operations/puppet@production] install_server: fix line break in gitlab parman recipe

https://gerrit.wikimedia.org/r/906596

Change 907819 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: start gitlab raids with smaller minimum size

https://gerrit.wikimedia.org/r/907819

Change 907819 merged by Jelto:

[operations/puppet@production] install_server: start gitlab raids with smaller minimum size

https://gerrit.wikimedia.org/r/907819

Change 907893 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: change device names in gitlab-raid1

https://gerrit.wikimedia.org/r/907893

Change 907893 merged by Jelto:

[operations/puppet@production] install_server: change device names in gitlab-raid1

https://gerrit.wikimedia.org/r/907893

Change 908491 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: use raidid in gitlab-raid1 recipe

https://gerrit.wikimedia.org/r/908491

Change 908491 merged by Jelto:

[operations/puppet@production] install_server: use raidid in gitlab-raid1 recipe

https://gerrit.wikimedia.org/r/908491

Change 908832 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: configure root raid only on gitlab-raid1

https://gerrit.wikimedia.org/r/908832

Change 908832 merged by Jelto:

[operations/puppet@production] install_server: configure root raid only on gitlab-raid1

https://gerrit.wikimedia.org/r/908832

Change 909749 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: add script to create fs and raid for backup partition

https://gerrit.wikimedia.org/r/909749

Change 909749 merged by Jelto:

[operations/puppet@production] gitlab: add script to create fs and raid for backup partition

https://gerrit.wikimedia.org/r/909749

I'm closing this task as all design and research has happened. Implementation has also happened with installation of additional disks. I also updated GitLab/Backup_and_Restore.

Some implementation work is still open around increasing the backup frequency and implementing incremental backups. This will happen in T324506 and T316935.