Page MenuHomePhabricator

Migrate gitlab storage to apus (also: backups from S3?)
Closed, ResolvedPublic

Description

With the "apus" multi-site S3 cluster ready (per T279621), we're going to migrate gitlab's storage to it as a first production user. I understand the plan is to detach one of the production replicas and move it first by way of testing.

This task is to track the migration work (and any other related work that becomes necessary).

Todo:

  • Rename the thanos_storage_enabled Hiera flag
  • Talk to @jcrespo about the backup strategy. Are we better off backing up from apus, or using the existing /srv/gitlab-backups arrangement – either way the amount of space would seem to be roughly the same
  • Detach an instance for testing and revive the test instances. Use this to test moving data over and back, and any backup cron jobs that we decide on
    • enable object storage on gitlab1003
    • sync all packages and artifacts
    • do some tests
      • backup runtime: around 11 minutes and 30GB of size
      • download old packages
      • latency etc.
      • test read-only credentials
    • disable object storage again
  • Deployment plan
    • empty object storage again and remove test/replica files
    • create buckets and ACLs again
    • enable object storage on the production host
      • artifacts
      • packages
    • wait until sync is done
      • artifacts
      • packages
    • verify download of artifacts/packages still works
      • artifacts
      • packages
    • enable object storage for replicas/cleanup replicas
      • artifacts
      • packages
    • double check backups and bacula is capturing the packages-mirror
    • remove objects from automation (failover cookbooks)
      • artifacts
      • packages
    • updated docs

Details

Other Assignee
Jelto
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+6 -0
operations/puppetproduction+5 -33
operations/puppetproduction+2 -2
operations/puppetproduction+1 -2
operations/puppetproduction+1 -4
operations/puppetproduction+6 -1
operations/puppetproduction+79 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+5 -1
operations/puppetproduction+1 -4
operations/puppetproduction+4 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -8
operations/puppetproduction+2 -4
operations/puppetproduction+1 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
labs/privatemaster+0 -0
labs/privatemaster+0 -0
operations/puppetproduction+13 -3
operations/puppetproduction+3 -0
operations/puppetproduction+30 -17
labs/privatemaster+11 -0
operations/puppetproduction+12 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

gitlab-artifacts is failing quite a lot to backup- so many entries on the log with missing file. Unsure if due to the operation mentioned at T378922#10842696 or the nature of it changing. But I wanted to notify that the backup there wouldn't be trusted anymore (we have millions of errors on bacula log right now).

Please I would ask you not to do backup tests on the main production backup storage- as files cannot really be deleted there once stored, and huge increases of metadata can really slow down backups in general (not only for gitlab, but for all other jobs). I can provide you a dedicated host for gitlab/gerrit, as I promised, but I need some time to set it up (as was planned), and if I knew it was going to happen, I could have sped it up its setup.

Can you provide some more context here? I'm not aware of any dedicated gitlab-artifacts backup. The production backup in /srv/gitlab-backup/ should be one .tar file without artifacts. Are there backups in place for the apus s3://gitlab-artifacts/ bucket already?

21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85593": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85499": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85708": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85582": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85575": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85742": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85802": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85504": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85590": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85767": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85684": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85795": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85752": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85729": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85490": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85522": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85726": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85600": ERR=No such file or directory
21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/f9/02f99d2002c703f1669e358989f1663e1e38e96297dcb3bb70fb67b0d74fb877/2023_03_25/85793": ERR=No such file or directory

Interesting, thank you. I think this has nothing to do with the ongoing work here. Bacula is trying to back up the directory while a GitLab backup is being created. Unfortunately, GitLab dumps all files into the folder, tars everything, and then removes the other files again. That causes the No such file or directory error.

Surprisingly, our incremental backup still contains artifacts (whereas the full backup is excluding them). I'll fix that in a second.

@jcrespo is there a fixed schedule when Bacula visits the GitLab machine? Then we might be able to shift the backup schedule a bit. Or maybe it's possible to just include .tar files in the bacula::director::fileset. But this will not prevent bacula from backing up incomplete .tar files.

Once we're on object storage, GitLab backup runtime should be significantly lower and the chance of overlap also less.

Thank you. I will try to separate those jobs to the dedicated storage hosts asap.

@jcrespo is there a fixed schedule when Bacula visits the GitLab machine?

They are scheduled at 4am UTC, however, sometimes there are delays. For example, today they ran at 7 due to unexpected clogging, due to T394883.

Change #1148804 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: also exclude artifacts from partial backups

https://gerrit.wikimedia.org/r/1148804

s3://gitlab-packages is empty after several hours (s3cmd del --force --recursive s3://gitlab-packages/). Using the same approach for s3://gitlab-artifacts is not really feasible due to the high latency and the huge number of objects. I was thinking about deleting the bucket directly, but I'm not sure what exactly happens to the objects in that case, and whether there are any runtime benefits.

I'll just leave the artifacts in the bucket for now and start a new sync on the production host once object storage is enabled again.

If it's helpful, there are admin commands for bucket deletion (that also remove the container objects) and are I think both quicker and kinder to the cluster (radosgw-admin bucket rm --bucket=<bucket name> --bypass-gc --purge-objects) which I could do for you.

s3://gitlab-packages is empty after several hours (s3cmd del --force --recursive s3://gitlab-packages/). Using the same approach for s3://gitlab-artifacts is not really feasible due to the high latency and the huge number of objects. I was thinking about deleting the bucket directly, but I'm not sure what exactly happens to the objects in that case, and whether there are any runtime benefits.

I'll just leave the artifacts in the bucket for now and start a new sync on the production host once object storage is enabled again.

If it's helpful, there are admin commands for bucket deletion (that also remove the container objects) and are I think both quicker and kinder to the cluster (radosgw-admin bucket rm --bucket=<bucket name> --bypass-gc --purge-objects) which I could do for you.

That would be great, so we can start with a fresh bucket and versioning.

So radosgw-admin bucket rm --bucket=gitlab-artifacts --bypass-gc --purge-objects would be the needed command then. I already emptied gitlab-packages manually with the s3cmd commands, so it's not strictly needed here. But for consistency we could do the bucket rm for both gitlab-artifacts and gitlab-packages. Then I can re-create the buckets and apply the ACL again.

Mentioned in SAL (#wikimedia-operations) [2025-05-21T09:32:24Z] <Emperor> radosgw-admin bucket rm --bucket=gitlab-packages --bypass-gc --purge-objects T378922

Mentioned in SAL (#wikimedia-operations) [2025-05-21T09:34:06Z] <Emperor> radosgw-admin bucket rm --bucket=gitlab-artifacts --bypass-gc --purge-objects T378922

@Jelto both buckets deleted.

Thanks a lot for the help with radosgw-admin! Unfortunately s3://gitlab-artifacts still exists:

jelto@cumin2002:~$ s3cmd ls
2025-04-15 08:46  s3://gitlab-artifacts
jelto@cumin2002:~$ s3cmd ls s3://gitlab-artifacts/ --recursive | wc -l
400253

Is the bucket rm command still running/pending? If not, could you re-try the delete?

Ah, the bucket is gone from eqiad, but codfw is still catching up:

root@moss-be2001:/# radosgw-admin bucket sync status --bucket=gitlab-artifacts
          realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus)
      zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg)
           zone acc58620-9fac-476b-aee3-0f640100a3bb (codfw)
         bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
   current time 2025-05-22T09:33:40Z

    source zone 64f0dd71-48bf-45aa-9741-69a51c083556 (eqiad)
  source bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
                incremental sync on 11 shards
                bucket is behind on 11 shards
                behind shards: [0,1,2,3,4,5,6,7,8,9,10]

This is annoying (and perhaps my fault for trying to be too clever); I presume that while the removal on the master zone was quick, it still ends up being replicated to the secondary zone as a long series of individual object deletions. I'll monitor how that is progressing (I'm hoping this is a case of "wait, it'll get there"), and also see if I can figure out a better approach for next time we want to delete an object with O(100k) objects in.

Change #1148804 merged by Jelto:

[operations/puppet@production] gitlab: also exclude artifacts from partial backups

https://gerrit.wikimedia.org/r/1148804

Ah, the bucket is gone from eqiad, but codfw is still catching up:

root@moss-be2001:/# radosgw-admin bucket sync status --bucket=gitlab-artifacts
          realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus)
      zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg)
           zone acc58620-9fac-476b-aee3-0f640100a3bb (codfw)
         bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
   current time 2025-05-22T09:33:40Z

    source zone 64f0dd71-48bf-45aa-9741-69a51c083556 (eqiad)
  source bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
                incremental sync on 11 shards
                bucket is behind on 11 shards
                behind shards: [0,1,2,3,4,5,6,7,8,9,10]

This is annoying (and perhaps my fault for trying to be too clever); I presume that while the removal on the master zone was quick, it still ends up being replicated to the secondary zone as a long series of individual object deletions. I'll monitor how that is progressing (I'm hoping this is a case of "wait, it'll get there"), and also see if I can figure out a better approach for next time we want to delete an object with O(100k) objects in.

Great thanks! Running the zone sync over the weekend is also fine from my side. I have some concerns it might take much longer than the weekend, but let's see. I also opened a subtaks to double-check GitLabs artifacts retention policy (T395014). The policy should delete artifacts older than 7 days but 400k sounds a bit too.

I am working on setting up the dedicated gitlab/gerrit storage host, but at the moment not yet on a specific backup solution for this ticket, waiting first to get some guidelines from the team about the requirements and kind of recovery strategy needed.

I understand that the actual migration is a priority first (not intending to rush it), just want to clarify I haven't started until we speak again on an analysis from your team on how to approach it- and after that I will talk to Mathew on how to best implement it. More like setting expectations that this part is not progressing atm (and that is ok for me!).

The dedicated backup host is my priority for now, as their resources will be needed no matter the solution chosen.

Ah, the bucket is gone from eqiad, but codfw is still catching up:

root@moss-be2001:/# radosgw-admin bucket sync status --bucket=gitlab-artifacts
          realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus)
      zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg)
           zone acc58620-9fac-476b-aee3-0f640100a3bb (codfw)
         bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
   current time 2025-05-22T09:33:40Z

    source zone 64f0dd71-48bf-45aa-9741-69a51c083556 (eqiad)
  source bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
                incremental sync on 11 shards
                bucket is behind on 11 shards
                behind shards: [0,1,2,3,4,5,6,7,8,9,10]

This is annoying (and perhaps my fault for trying to be too clever); I presume that while the removal on the master zone was quick, it still ends up being replicated to the secondary zone as a long series of individual object deletions. I'll monitor how that is progressing (I'm hoping this is a case of "wait, it'll get there"), and also see if I can figure out a better approach for next time we want to delete an object with O(100k) objects in.

Great thanks! Running the zone sync over the weekend is also fine from my side. I have some concerns it might take much longer than the weekend, but let's see. I also opened a subtaks to double-check GitLabs artifacts retention policy (T395014). The policy should delete artifacts older than 7 days but 400k sounds a bit too.

This went no-where over the weekend, so I deployed Large Hammers (radosgw-admin bucket rm --bypass-gc --purge-objects --bucket=gitlab-artifacts --yes-i-really-mean-it in the secondary zone (codfw), followed by radosgw-admin metadata sync init to sort the metadata inconsistency, and then radosgw-admin data sync init --source-zone eqiad to re-sync data (and then also do a restart of the rgws in eqiad per docs). I think we are now back to being properly in sync.

Thanks a lot @MatthewVernon, I can confirm the buckets are gone. I'll re-create the buckets soon and apply the ACLs. Before enabling object storage for the artifacts, I'll spend a day or two on T395014. Maybe we can avoid such hard measures in the future if it's possible to delete some outdated artifact files.

Yeah, the apus cluster isn't ideal for buckets with a very large number of objects in (if we wanted to start aiming to support such a use case, we'd want to use some SSD/NVME specifically for bucket indexes, which I think would involve some hardware hassle), so if it's straightforward to keep the artifacts bucket to a smaller number that'd be nice.

Change #1148796 merged by Jelto:

[operations/puppet@production] gitlab: enable object storage for gitlab-artifacts in production

https://gerrit.wikimedia.org/r/1148796

Total number of artifacts was reduced from 400k to around 100k in T395014. I enabled object storage for the artifacts and triggered a sync to object storage using sudo gitlab-rake gitlab:artifacts:migrate. I'll monitor Ceph and GitLab during the migration.
Once the migration is finished, I'll enable object storage for the artifacts also on the replicas.

Change #1153655 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "gitlab: enable object storage for gitlab-artifacts in production"

https://gerrit.wikimedia.org/r/1153655

Change #1153655 merged by Jelto:

[operations/puppet@production] Revert "gitlab: enable object storage for gitlab-artifacts in production"

https://gerrit.wikimedia.org/r/1153655

Change #1153942 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable object storage for gitlab-artifacts in production

https://gerrit.wikimedia.org/r/1153942

Change #1153942 merged by Jelto:

[operations/puppet@production] gitlab: enable object storage for gitlab-artifacts in production

https://gerrit.wikimedia.org/r/1153942

The artifact upload issues have been resolved (T396018).

CI job logs and metric look normal. So I'll trigger the migration to move the production GitLab CI artifacts to object storage.

All CI artifacts have been successfully migrated to object storage. The overall disk usage on the GitLab host has already dropped from 50% to 39%, even though packages are still not using object storage.

I'll enable object storage on the replicas as well and will follow up soon regarding the backup situation for gitlab-packages.

Change #1154020 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable object storage on all hosts

https://gerrit.wikimedia.org/r/1154020

Change #1154020 merged by Jelto:

[operations/puppet@production] gitlab: enable object storage on all hosts

https://gerrit.wikimedia.org/r/1154020

Change #1155146 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: remove artifacts from failover backup

https://gerrit.wikimedia.org/r/1155146

Change #1155146 merged by Jelto:

[operations/puppet@production] gitlab: remove artifacts from failover backup

https://gerrit.wikimedia.org/r/1155146

I am working on setting up the dedicated gitlab/gerrit storage host, but at the moment not yet on a specific backup solution for this ticket, waiting first to get some guidelines from the team about the requirements and kind of recovery strategy needed.

I understand that the actual migration is a priority first (not intending to rush it), just want to clarify I haven't started until we speak again on an analysis from your team on how to approach it- and after that I will talk to Mathew on how to best implement it. More like setting expectations that this part is not progressing atm (and that is ok for me!).

The dedicated backup host is my priority for now, as their resources will be needed no matter the solution chosen.

Thank you for the work on dedicated hardware. In T378922#10804784 I tried to outline the current requirements for the backup needs for all GitLab data on object storage. Where do you need more details?

One addition to the requirements:

Artifacts" include about 400k smaller objects totaling around 40GB. (some of those 400k files could probably be deleted/expired).

Artifacts were reduced to a bit under 100k objects and 10GB in size in T395014: Check GitLab artifact retention time.

Thank you for the work on dedicated hardware. In T378922#10804784 I tried to outline the current requirements for the backup needs for all GitLab data on object storage. Where do you need more details?

I'm sorry, but I thought that was an "outline", a summary of our discussion, with some, but not a lot of relevant data about how to implement backups. For example, it says: "Data can be restored"

But what data? and what does it mean to restore it? An object storage is like a database, an ever changing store system. So, how frequently should it be backed up? How does it change (e.g. what percentage of files are added/updated/removed). When restoring- at what time should it be brought back? E.g. let's say a daily backups is ok- does it mean we have to "snapshot" it every day and restore it in full, then keep it for 3 months. That would make bacula an actual option to be used- as it works ok for "full-like" type of backups, but probably would be quite slow, because it is stored in a very serialized format. It will also mean you will lose 24 hour of writes.

On the other side, if a small amount or files are changed every time, then we could backup in a diff-like way, and maybe use some timestamps and some object storage system to save space. Other storage system could allow a parallel restore, which may be a requirement to do it quickly enough.

Again, that depends on what the recoveries look like. What is the expected data to be recovered. What does it look like and what the process looks like? Will you expect to only recover the whole of the data, or individual objects, or partial somehow? How would you control that? What is the maximum amount of time reasonable for any of those operations? That is the main answers I need to design how backups are done. E.g. if 200 GB are created daily, we will require ~20TB of space over 90 daily backups.

I'm sorry, but I thought that was an "outline", a summary of our discussion, with some, but not a lot of relevant data about how to implement backups. For example, it says: "Data can be restored"

The full sentence was focusing on the restore scenario and what happens to GitLabs availability during a restore: "Data can be restored without downtime, but a higher 404 rate may occur.". So we don't have to downtime GitLab for the restore.

But what data?

All data in the object storage primarily contains CI-generated objects (job logs, artifacts, builds, etc.). More specifically there are "Packages" consist of around 4000 larger objects totaling approximately 140GB and "Artifacts" include about 100k smaller objects totaling around 10GB.

and what does it mean to restore it?

The ideal restore scenario would be to just copy the backed-up data into the current object storage buckets in apus. It would also be possible to create new buckets with the backup-up data or copy the data directly onto the GitLab host, then we have to manually migrate them to object storage again.

An object storage is like a database, an ever changing store system. So, how frequently should it be backed up?

Given the expendable nature of the data, a weekly backup schedule seems sufficient.

How does it change (e.g. what percentage of files are added/updated/removed).

About 0.5% of the data changes daily, the rest stays the same. We just keep 180 days of CI files in GitLab so that's 1/180 per day.

When restoring- at what time should it be brought back? E.g. let's say a daily backups is ok- does it mean we have to "snapshot" it every day and restore it in full, then keep it for 3 months. That would make bacula an actual option to be used- as it works ok for "full-like" type of backups, but probably would be quite slow, because it is stored in a very serialized format. It will also mean you will lose 24 hour of writes.

Restore should happen within a few days with no strong preference when exactly. I don't expect big disruptions by missing artifacts or packages for a few days. However losing them completely would be disruptive for end users and deployments. Loosing some hours or days is also fine for some time. We just should not lose all of the data.

On the other side, if a small amount or files are changed every time, then we could backup in a diff-like way, and maybe use some timestamps and some object storage system to save space. Other storage system could allow a parallel restore, which may be a requirement to do it quickly enough.

As mentioned above roughly 0.5% of the data changes daily and the data are small individual files which could be diffed, some files might be a bit bigger (100-500mb) for big release binaries.

Again, that depends on what the recoveries look like. What is the expected data to be recovered. What does it look like and what the process looks like? Will you expect to only recover the whole of the data, or individual objects, or partial somehow?

I would expect all of the backup-up CI files (artifacts and packages) are restored from the backup and written to the object storage bucket at once, in bulk. All existing files on the object storage (if any) could be deleted/overwritten. So just full restore, no restore of individual files.

How would you control that? What is the maximum amount of time reasonable for any of those operations?

Restore should happen within a few days with no strong preference when exactly. An acceptable restore time could be anything between a few hours to one or two days. I'm open to how the restore is controlled. Anything from a cookbook, a custom script, pinging Data Persistence or opening a Phab task is fine I think. I don't expect the restore to happen more than once a year at maximum.

That is the main answers I need to design how backups are done. E.g. if 200 GB are created daily, we will require ~20TB of space over 90 daily backups.

A rough estimate would be 10GB of new data per week (weekly backups).

Thanks, that's more insightful and helpful, I will give it a think and maybe talk to Matthew and will try to work on a proposal that will work for those constrains.

I want to give you an (non-)update that I haven't forgotten about this- sadly, there was a need to do some basic infrastructure refactoring (setting up new servers, decommissioning new ones and migrating data, upgrading to bookworm) that was a blocker for this (no gitlab backup storage server could be setup without doing that), but after that finishes (hopefully, next week), this will be my next task.

I leave you with some homework meanwhile: T387833#10952842

Thanks for the update @jcrespo !

T387833#10952842 should be unrelated to the efforts of migrating GitLab to object storage. But thanks for raising this issue, we will look into this for Gerrit.

My proposal to move forward is to sync the files from object storage to a local folder on the GitLab host. Ideally, we could sync the files to a folder like /srv/gitlab-backups/packages, which would already be part of a bacula::director::fileset. So we would need a job that regularly runs s3cmd sync s3://gitlab-packages /srv/gitlab-backups/packages (like every 24h or every 7d).

Currently, there are roughly 4800 package files with a total size of 264G, so mostly big files. The performance issues we saw with 100k+ artifacts should not be a problem here. Also, the diff between each sync should be relatively small, with only a dozen new packages per day. So bacula would just see a few new packages each day instead of a single big archive blob which can't be diffed properly. The regular GitLab backup would be significantly smaller after that change. So I'd expect less disk usage for the GitLab backups with the new approach without knowing the details of bacula.

The proposed solution is quite GitLab-specific. However, we could try to implement a profile::ceph::s3::client::sync_local profile to abstract the use case of syncing a bucket to a local folder. This profile could then be reused in more generic backup tooling later on.

If that unblocks you, I am ok with that- sadly because other priorities keep entering data persistence with unbreak now priority, and the very low people resources we have now, there was no time to work on this yet, so please feel free to go ahead with a quick but useful solution to you, and at a later time we will be able to take over that automation and own it in a more unified and general way on backup infrastructure, so it can also be applied to other similar needs.

Please understand that this is one of the big tasks we have pending, and it is documented and communicated, but sadly a very limited amount of project work is possible atm, and we are currently just reactive rather than proactive. Hopefully things will change when/if more hiring happens.

Change #1189120 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] ceph: add module to sync a bucket locally

https://gerrit.wikimedia.org/r/1189120

Change #1189444 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable object storage for packages

https://gerrit.wikimedia.org/r/1189444

Change #1189120 merged by Jelto:

[operations/puppet@production] ceph: add module to sync a bucket locally

https://gerrit.wikimedia.org/r/1189120

Change #1191984 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] ceph::client::sync_local: fix ensure for directory

https://gerrit.wikimedia.org/r/1191984

Change #1191984 merged by Jelto:

[operations/puppet@production] ceph::client::sync_local: fix ensure for directory

https://gerrit.wikimedia.org/r/1191984

Sync from object storage to a local folder works with the new ceph::client::sync_local module. I tested this on GitLab replica gitlab2002:

jelto@gitlab2002:~$ sudo systemctl start s3-sync-bucket.service
jelto@gitlab2002:~$ sudo ls -l /srv/gitlab-backup/packages-mirror
total 4
-rw-r--r-- 1 jelto wikidev 6 Sep 29 07:31 test.txt

The text.txt which I added to the s3://gitlab-packages bucket was synced to the GitLab host.

I'll disable the sync on the replica again (it was just enabled as a test) and enable it for the production host. Then I'll enable usage of object storage for GitLab packages and trigger a migration.

Change #1192060 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: enable bucket sync on production host

https://gerrit.wikimedia.org/r/1192060

Change #1192060 merged by Jelto:

[operations/puppet@production] gitlab: enable bucket sync on production host

https://gerrit.wikimedia.org/r/1192060

Change #1189444 merged by Jelto:

[operations/puppet@production] gitlab: enable object storage for packages

https://gerrit.wikimedia.org/r/1189444

Jelto updated the task description. (Show Details)

Change #1192506 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: fix s3 bucket sync re-download

https://gerrit.wikimedia.org/r/1192506

Change #1192506 merged by Jelto:

[operations/puppet@production] gitlab: fix s3 bucket sync re-download

https://gerrit.wikimedia.org/r/1192506

All GitLab packages are migrated to APUS object storage. The additional sync from bucket to the backup directory /srv/gitlab-backup also works and performance of the APUS cluster is great. A small fix was needed (see patch above) to make sure s3cmd does not pull all packages on every run. But after fixing the trailing slashes in the folder path the sync works without generating too much traffic.

The GitLab internal backup is significantly faster (50 minutes instead of 6 hours) and smaller (55GB instead of 400GB).

Also bconsole contains the packages in the most recent backup:

$ cd /srv/gitlab-backup/packages-mirror
cwd is: /srv/gitlab-backup/packages-mirror/
$ ls
00/
03/
...
ef/
f6/
fa/
test.txt

The backup metrics look as expected. I'll monitor the metrics for the next few runs.

I'll continue with some cleanup which should reduce the backup size on bacula even more. We don't need a dedicated "partial" backup anymore (this backup excluded packages, which we don't need anymore). Also the replicas still have the packages on disk, so a bit of cleanup is needed there as well.

I'll also update https://wikitech.wikimedia.org/wiki/GitLab/Object/Storage.

Do we need daily full backups for objects? Assuming only a few object change per day, cannot we do incrementals? We currently don't have the space to backup daily 500 GB until we dedicate the storage for gitlab.

Sorry, I am not sure if you refer to gitlab package or object backup. The packages should be full, but I was hoping to do incrementals for objects.

Change #1192535 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: remove packages from daily full backups

https://gerrit.wikimedia.org/r/1192535

Do we need daily full backups for objects? Assuming only a few object change per day, cannot we do incrementals? We currently don't have the space to backup daily 500 GB until we dedicate the storage for gitlab.

Sorry, I am not sure if you refer to gitlab package or object backup. The packages should be full, but I was hoping to do incrementals for objects.

Daily full backups are not needed for objects/packages. You are right, just a few files are added to the packages per day and 99%+ files are not changing. I opened https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192535 to create a dedicated fileset for the packages mirror on the GitLab host. I'm not sure which job defaults are used then, but incremental should be fine, also weekly backups would work. I'm not sure if there is such a policy.

I'm not sure which job defaults are used then, but incremental should be fine

The default config "Monthly" has daily incrementals and weekly differencials/fulls. We use only the Daily config for compressed packages such as gitlab exports & database dumps.

Change #1192562 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: disable and remove partial backup

https://gerrit.wikimedia.org/r/1192562

Change #1192562 merged by Jelto:

[operations/puppet@production] gitlab: disable and remove partial backup

https://gerrit.wikimedia.org/r/1192562

Change #1192535 merged by Jelto:

[operations/puppet@production] gitlab: remove packages from daily full backups

https://gerrit.wikimedia.org/r/1192535

Deployment-wise everything is done, gitlab artifacts and packages use the object storage backend now.

I also updated the docs in https://wikitech.wikimedia.org/wiki/GitLab/Backup_and_Restore and https://wikitech.wikimedia.org/wiki/GitLab/Object/Storage.

Talk to @jcrespo about the backup strategy. Are we better off backing up from apus, or using the existing /srv/gitlab-backups arrangement – either way the amount of space would seem to be roughly the same

For the backup part I'm not fully sure if we should resolve this task. GitLab uses a new generic puppet module to clone the bucket into a bacula-monitored folder to have backups. This solution is not ideal but it works for our use-case. So maybe it makes sense to open a follow up task for a more generic approach. Also some backup refactoring is happening in T403946 which should be independent from this task here.

So I'm inclined to resolve this task and address the backup topics in T403946 and another follow up task.

All good to me to close an follow up later, but please let's merge T403946 asap (not asking you, I've been the one that is always busy to finish it :-( ).

Great, then I'll resolve this task. I opened T406824: Evaluate generic backup tooling for object storage buckets as a follow up to track the object storage backup work in a more generic way.

I'll keep an eye on T403946. Unfortunately this week is some planned Gerrit maintenance, so this is blocked (at least for today). We could move forward with GitLab and delay Gerrit a bit.

Thanks again @MatthewVernon and @jcrespo for all the support here. This was a big step forward with GitLab and unblocked the long running and big backups and other pain points we had with GitLab. The backup, restore and failover time is reduced significantly now (from 5+ hours to around 1h).