Page MenuHomePhabricator

Investigate incremental backups for GitLab
Closed, ResolvedPublic

Description

One of the prerequisites before performing Gitlab upgrades is creating a full backup. While performing the last upgrade, I noticed this step takes a long time which is bound to increase with more Gitlab usage. I was wondering if there was a way we can optimize this step.

One feature to test is the recently released incremental backup option.

Event Timeline

As an idea, I was combing through the docs (https://docs.gitlab.com/ee/raketasks/backup_gitlab.html#incremental-repository-backups) and noticed Gitlab supports incremental backups:

sudo gitlab-backup create INCREMENTAL=yes PREVIOUS_BACKUP=<timestamp_of_backup>

@Jelto Is it possible that we can adopt this going forward since we will always have a near-fresh back up each time we perform an upgrade as the backup task runs daily?

LSobanski triaged this task as Medium priority.Dec 12 2022, 4:55 PM
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

As an idea, I was combing through the docs (https://docs.gitlab.com/ee/raketasks/backup_gitlab.html#incremental-repository-backups) and noticed Gitlab supports incremental backups:

sudo gitlab-backup create INCREMENTAL=yes PREVIOUS_BACKUP=<timestamp_of_backup>

@Jelto Is it possible that we can adopt this going forward since we will always have a near-fresh back up each time we perform an upgrade as the backup task runs daily?

Thanks for opening the task! Incremental backups seem to be a quite new feature (enabled in 14.10).

I like the idea and we should try out incremental backups. This would speed up the upgrade process but could also be useful in general. We should do some tests and compare the duration of a full and a incremental backup (and also a restore). That tests can happen on the replicas.

I have some concerns because the docs state The chosen previous backup is overwritten.. So we might see quite a lot of IO here for extracting the tar archive and writing it again. But tests will show that.

This is also somehow related to T316935.

Alternatively, as discussed in the past, one thing that we could do is incremental backups in a raw way- if we do a filesystem snapshot (e.g. with LVM) and perform hourly incrementals, we could go back to the same support as currently gerrit has. Recovery should be much faster than reimporting data to a database. This would not be a substituting method, but complimentary to daily dumps.

LSobanski renamed this task from Optimize Gitlab Backups to Investigate incremental backups for GitLab.Jan 17 2023, 4:34 PM
LSobanski updated the task description. (Show Details)

Thanks @Jelto

I have some concerns because the docs state The chosen previous backup is overwritten.. So we might see quite a lot of IO here for extracting the tar archive and writing it again. But tests will show that.

Yes, you are absolutely right about this. I performed the incremental backup on gitlab1003 and it pretty much maxed out the disk usage. The related data can be viewed here. https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=gitlab1003&var-datasource=thanos&var-cluster=misc&viewPanel=6&from=1675183965032&to=1675188746668

I also honestly did not notice any improvement time-wise compared to just performing the backup like we normally do. So maybe the improvements are almost unnoticeable with larger amounts of data?

Yes, you are absolutely right about this. I performed the incremental backup on gitlab1003 and it pretty much maxed out the disk usage. The related data can be viewed here. https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=gitlab1003&var-datasource=thanos&var-cluster=misc&viewPanel=6&from=1675183965032&to=1675188746668

I also honestly did not notice any improvement time-wise compared to just performing the backup like we normally do. So maybe the improvements are almost unnoticeable with larger amounts of data?

From looking at the graphs and logs, it seems just a second backup was created 1675185603_2023_01_31_15.7.5_gitlab_backup.tar (timestamp is Tue Jan 31 2023 17:20:03 GMT+0000).
Disk space usage also shows that we stored a second backup: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=gitlab1003&var-datasource=thanos&var-cluster=misc&from=1675182752066&to=1675215588052&viewPanel=12

For documentation here the backup command that was executed:

/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE=yes SKIP=builds,artifacts,registry GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1 INCREMENTAL=yes PREVIOUS_BACKUP=1675123457_2023_01_31_15.7.5

I just looked at the documentation again and it states:

Only repositories support incremental backups. Therefore, if you use INCREMENTAL=yes, the task creates a self-contained backup tar archive. This is because all subtasks except repositories are still creating full backups

So it seems we have to exclude everything except from repositories to make incremental backups work (see excluding-specific-directories-from-the-backup). Otherwise we will get a full backup.

The benefits here are limited if we can backup repositories only. Even for upgrades I'd also want to backup the database at least.
There is a open issue to enable incremental backups for all other sources as well: https://gitlab.com/gitlab-org/gitlab/-/issues/19256

Maybe you can test the incremental backup again with only repositories, so we see how long that takes?

@Jelto I tested this on gitlab1004.

aokoth@gitlab1004:~$ time sudo /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE=yes \
SKIP=db,uploads,builds,artifacts,lfs,terraform_state,registry,pages,packages \
GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1 INCREMENTAL=yes \
PREVIOUS_BACKUP=1679875463_2023_03_27_15.7.8

real    4m55.773s
user    0m24.317s
sys     2m0.759s

One new file was created in /srv/gitlab-backup significantly smaller than the original backup. That means the new file only contains a repository backup, right?

aokoth@gitlab1004:~$ sudo ls -lah /srv/gitlab-backup/
total 71G
<redacted>...
-rw------- 1 git  git   57G Mar 27 00:47 1679875463_2023_03_27_15.7.8_gitlab_backup.tar
-rw------- 1 git  git  6.9G Mar 27 16:16 1679933743_2023_03_27_15.7.8_gitlab_backup.tar
<redacted>...

Thanks @Arnoldokoth for the tests. Yes it seems the "incremental" backup is significantly smaller and faster than a full backup. But it's not fully clear to me what's happening internally and what backup file we need for a restore. Documentation around the incremental backup feature is quite limited. Also the output of backup jobs is very limited (when leaving the CRON=1 parameter).

If I run the same command, without INCREMENTAL=yes and without PREVIOUS_BACKUP=1679875463_2023_03_27_15.7.8 I get a similar sized additional backup file in even less time:

time /usr/bin/gitlab-backup create STRATEGY=copy GZIP_RSYNCABLE=yes SKIP=db,uploads,builds,artifacts,lfs,terraform_state,registry,pages,packages GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1 PREVIOUS_BACKUP=1679961868_2023_03_28_15.7.8

real    2m9.635s
user    2m34.140s
sys     1m5.647s

If I run incremental backups without the SKIP parameter, I get a full backup:

time /usr/bin/gitlab-backup create STRATEGY=copy GZIP_RSYNCABLE=yes GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1 INCREMENTAL=yes
real    58m6.713s
user    48m1.385s
sys     9m50.958s

So I'm not sure what benefit we have in running incremental backups if we just end up with a dedicated repository backup. Maybe you can do some more tests regarding restore of the backup. So how do we restore a incremental backup in comparison to a full backup? And what files do we need for a restore?

I think doing more frequent repository backups is still a good idea. But I don't understand the benefits of incremental backups at the moment instead of just backing up repositories (like discussed in T316935).

Yeah, I've also poked around and I'm not really seeing the value of incremental backups at the moment... It only works with repositories as you said,,, I tested this on the cloud instance. I created a normal backup, created a new repository then created a backup. I also went through the content of said backup and of course, it only contained repositories. So I think this would be viable if it supported all the objects.

I think we can close this for now.

Then let's close the task and focus on T316935.