Page MenuHomePhabricator

Investigate object storage for Gitlab
Closed, ResolvedPublic

Description

Gitlab supports using an object storage service for hosting packages, artefacts, etc. One of the modes it supports is Swift. We should determine if this is viable for our use as we scale Gitlab

Docs are at https://docs.gitlab.com/ee/administration/object_storage.html

Why

In a backup of Gitlab, object storage makes up >85% of the total [0]. When we do a backup/restore, the backup process takes close to an hour, and the restore process takes 40-45 minutes. As we migrate from Gerrit to Gitlab, this will only get larger and approach being unsustainable.

Moving object storage away from the on-host storage would free up resources, speed up switchovers/backups, and allow us to grow usage.

Questions to answer by Collab

  • Will this work from the Gitlab side?
  • Does this reduce our backup/restore time? By how much?
  • If we switch between datacentres, are references to objects maintained?

Questions to answer by Data Persistence

  • Is it feasible for us to use Swift for this purpose?
  • What would our limits on storage be? Is 100gb reasonable? 200gb?
  • How do we track our usage and help capacity plan?
  • Is the storage backed up?
  • Is it replicated across sites?

[0] ~60gb as of the end of April

Event Timeline

eoghan triaged this task as Medium priority.May 9 2023, 9:24 AM

I think here we are talking about using the S3 protocol? That is currently only enabled on the thanos cluster (MOSS is a maybe-next-FY sort of thing, but will also do S3, but I think you probably don't want to wait for that). Thanos is also the only replicated cluster.

I don't think thanos is currently backed up; @jcrespo is maestro of backups.

Thanos is currently 80% full, and I don't think there is any expansion planned in the coming FY (@fgiunchedi might know otherwise); I think 100g (x3 since that's the replication factor of thanos) would be OK; but we don't want the cluster to get to close-to-full (I think performance slows and we become less fault-tolerant).

More generally, capacity-planning for swift is still very ad-hoc (and fixing that has yet to make it high enough up the priority ladder to get done).

I don't think thanos is currently backed up; @jcrespo is maestro of backups.

I suggested doing it but the answer I got was no, so no current backups.

I think here we are talking about using the S3 protocol? That is currently only enabled on the thanos cluster (MOSS is a maybe-next-FY sort of thing, but will also do S3, but I think you probably don't want to wait for that). Thanos is also the only replicated cluster.

I don't think thanos is currently backed up; @jcrespo is maestro of backups.

Thanos is currently 80% full, and I don't think there is any expansion planned in the coming FY (@fgiunchedi might know otherwise); I think 100g (x3 since that's the replication factor of thanos) would be OK; but we don't want the cluster to get to close-to-full (I think performance slows and we become less fault-tolerant).

That's correct re: expansion, I'll follow up in a separate task re: trimming thanos data retention so we don't run out of space

I don't think thanos is currently backed up; @jcrespo is maestro of backups.

I suggested doing it but the answer I got was no, so no current backups.

The naming is conflated; though the answer I gave re: thanos backups is in the context of thanos metric data. With respect other users of thanos (thanos the swift cluster) the backup decision should be made with the users themselves I think

I think here we are talking about using the S3 protocol? That is currently only enabled on the thanos cluster (MOSS is a maybe-next-FY sort of thing, but will also do S3, but I think you probably don't want to wait for that). Thanos is also the only replicated cluster.

I don't think thanos is currently backed up; @jcrespo is maestro of backups.

Thanos is currently 80% full, and I don't think there is any expansion planned in the coming FY (@fgiunchedi might know otherwise); I think 100g (x3 since that's the replication factor of thanos) would be OK; but we don't want the cluster to get to close-to-full (I think performance slows and we become less fault-tolerant).

That's correct re: expansion, I'll follow up in a separate task re: trimming thanos data retention so we don't run out of space

This is done, we're at ~72% used on Thanos now

Thanks for the feedback! Is there a test cluster that wmcs can connect to that we might be able to use with a test instance of gitlab in order to give it a try before we do this on any of the existing production nodes/replicas?

I'm afraid not (unless there's a thanos setup in beta); you could spin one up in a pontoon stack, but that might be more work than you wanted!

Hey @MatthewVernon, we're picking up on some of this work again and we'd like to test migrating our object storage to thanos, with a view to running it in production shortly afterwards if it does give us the wins we need. Can we get access to the cluster for use in Gitlab so we can run some tests? I'm working on a test plan now, I'll update the ticket with that shortly.

Change 944163 had a related patch set uploaded (by MVernon; author: MVernon):

[labs/private@master] thanos: fake credential for gitlab account

https://gerrit.wikimedia.org/r/944163

Change 944164 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/puppet@production] thanos: add gitlab user

https://gerrit.wikimedia.org/r/944164

Change 944164 merged by MVernon:

[operations/puppet@production] thanos: add gitlab user

https://gerrit.wikimedia.org/r/944164

Change 944163 merged by MVernon:

[labs/private@master] thanos: fake credential for gitlab account

https://gerrit.wikimedia.org/r/944163

@eoghan the account is created in thanos-swift and ready for use (and the credential can be templated via puppet).

If for whatever reason you decide not to go ahead with using thanos-swift for storage, can you let us know so I can remove the account again, please?

We will of course, thanks for getting that done so fast!

We have successfully transferred the first CI artefacts from the test server over to remote storage!

# gitlab-rake gitlab:artifacts:migrate
I, [2023-08-11T13:20:20.670527 #501050]  INFO -- : Starting transfer to object storage
I, [2023-08-11T13:20:22.958586 #501050]  INFO -- : Transferred Ci::JobArtifact ID 1 with size 2082 to object storage

Testing is going well so far. Next week, we plan to test failing over to a different datacentre to confirm that we can read objects from Swift in codfw that were written in eqiad.

Here's a rundown of things we learnt after testing this week.

Key points:

  • ✅ It works well
  • ✅ It makes backup/restores a lot faster
  • ✅ We can write in one datacentre and read in another
  • ‼️ We need a backup solution for packages/artifacts
  • ❌ Moving from local -> remote is easy, other way is harder (not a blocker)

We can move objects to Swift slowly

Importantly, we don't need to take the service down (aside from a configuration restart) to enable/disable remote object storage. Packages and artifacts will continue to be served by gitlab from both local and remote storage, since the Gitlab DB will still have the path to the old

Migration from local disk to swift is with gitlab-rake gitlab:packages:migrate or gitlab-rake gitlab:artifacts:migrate. Migrating back from swift to local storage is also possible with gitlab-rake gitlab:artifats:migrate_to_local.

Artifacts take several hours to transfer (many small files), Packages is done within an hour (a few very large files).

We can build things

This wasn't a hugely scientific test, since the builds didn't all pass (probably missing different runner types), however we switched runner-1021 over to point at gitlab-replica-old.wm.o and ran some builds, and saw that artifacts were correctly uploaded to swift. The process for changing the host a runner is connected to is simple so this can be tested further in the future.

Backups are WAY faster and smaller

root@gitlab1003:/srv/gitlab-backup# time ./gitlab-backup.sh full

real    3m37.761s
user    4m35.737s
sys     1m37.851s

root@gitlab1003:/srv/gitlab-backup# du -sh 1692355388_2023_08_18_16.0.8_gitlab_backup.tar
9.7G    1692355388_2023_08_18_16.0.8_gitlab_backup.tar

root@gitlab1003:/srv/gitlab-backup# time ./gitlab-restore.sh
No REQUESTED_BACKUP provided, using latest backup /srv/gitlab-backup/1692355388_2023_08_18_16.0.8_gitlab_backup.tar
[snip]

real    11m34.955s
user    4m45.459s
sys     1m19.640s

This is compared to a full backup on another host (gitlab2002):

root@gitlab2002:/srv/gitlab-backup# time ./gitlab-backup.sh full

real    69m14.435s
user    62m27.256s
sys     12m27.617s

root@gitlab2002:/srv/gitlab-backup# du -sh 1692355785_2023_08_18_16.0.8_gitlab_backup.tar
90G     1692355785_2023_08_18_16.0.8_gitlab_backup.tar

Turning it off isn't entirely easy

There is no way of making an object storage bucket read-only. Thus if we want to disable this/roll back to local storage, we need to migrate everything back to a local host first. This could take many hours to roll back (using the rake tasks as before). An option is to pause all runners, run the migration rake tasks, then add the following config lines to gitlab.rb

gitlab_rails['object_store']['objects']['artifacts']['enabled'] = false
gitlab_rails['object_store']['objects']['packages']['enabled'] = false
  • We can write in one DC and read in another**

We haven't tested the replication lag, but to test that replication works as expected, we did the following:

  1. Backed up gitlab1003.eqiad (to make sure the metadata was present on the codfw host)
  2. Restore the backup onto gitlab2003.codfw
  3. Update the gitlab2003.codfw config to enable object storage
  4. Ensure that jobs loaded artifacts (e.g., job logs) correctly

We'll need a backup solution

After moving packages/artifacts to remote storage, the gitlab-backup command no longer includes the actual files, only the metadata from the database. Since thanos is not backed up, we would need to create our own backup solution. This probably wouldn't be a huge deal, but it is something we need to do before we move forward with this.

We've wrapped up testing on this for the moment, and we're fairly happy that it's where we want to go in the future. We're going to hold off until a little later in the FY to do this, though, and look at this as part of a larger scaling project.

I've freed up the space in the thanos-swift cluster, the gitlab-artifacts and gitlab-packages containers have been removed. @MatthewVernon, I'm not sure if you'd prefer to disable the credentials for the Gitlab user, but we're finished with it for the next few months at least.

Thanks for letting me know; given you're planning to come back to this credential later in the FY, I'm going to leave it in place (since the process for adding/removing swift credentials is rather hasslesome).

Sounds good to us. Thanks for your help!

Closing this in favour of a more detailed rollout plan to come later this year