Page MenuHomePhabricator

Check GitLab artifact retention time
Closed, ResolvedPublic

Description

GitLab has accumulated over 400k CI artifacts which causes problems with storage operations like backups or object storage migration.

GitLab is configured for a 7 day and 1 day retention, https://gitlab.wikimedia.org/admin/application_settings/ci_cd also says 1 day. So the number of artifacts should be limited to the most recent ones. However the growth of artifacts looks more linear to me and increases constantly.

So we should double check

  • Default Artifact Expiration at the Instance Level
  • check default Artifact Expiration at the Project Level
  • artifacts are actually expired and deleted
  • ensure artifacts Are Deleted
  • maybe manually delete old artifacts with a script or custom job

So there is a job for the artifact expiry Sidekiq::Cron::Job.find('expire_build_artifacts_worker') and recent logs in grep 'ExpireBuildArtifactsWorker' /var/log/gitlab/sidekiq/current.

Todos:

  • test lower ci_delete_pipelines_in_seconds, default_artifacts_expire_in and archive_builds_in_human_readable on the replica
  • set reasonable value (6mo) for settings above
    • replicas
    • production
  • update gitlab-settings
  • automate configuration of expiry policy somewhere (gitab-settings ci, puppet, configure-projects, ...?)

Event Timeline

Jelto triaged this task as High priority.
Jelto moved this task from Incoming to Work in Progress on the collaboration-services board.

I've done a bit more research using https://docs.gitlab.com/user/storage\_management\_automation/#manage-cicd-pipeline-storage on one of the replicas.

Instance-wide settings

The instance-wide setting Default artifacts expiration is set to 1 day, as expected (which is already quite aggressive and probably not what we want).

I also found the instance-wide setting Archive jobs in https://gitlab-replica-b.wikimedia.org/admin/application\_settings/ci\_cd. This was introduced in 17.9, so it's a relatively new feature, see https://docs.gitlab.com/ci/pipelines/settings/#automatic-pipeline-cleanup. I'll test if this setting has any effect on the number of artifacts.

Unfortunately this this setting does not propagate to existing projects or newly created ones, similar to Default artifacts expiration`. So we might have to adjust the per-project settings for existing projects manually (with a small script maybe). I'll do a bit more research for the instance-wide settings.

Per-project settings

When looking at a random, rather big project like airflow-dags, I see over 28k artifact objects (not packages, just artifacts).

I found the per-project option Automatic pipeline cleanup, which was not set (so basically no cleanup at all). I set the cleanup policy to 15d as a test, which decreased the artifact files from 28698 to under 10000 (cleanup jobs are still running).

So this looks like a quite promising approach. But I still have to check if this setting is different from the instance-wide one or if it's the same mechanism.

If it works as expected, we could start with a conservative value like 180d and see how that affects the artifact counts. Setting it too low might be annoying for users who need to access older jobs for troubleshooting or via saved links.

My tests with airflow-dags look quite promising. After setting the cleanup to 15d the number of artifacts went down from 28698 to 1232. I also tested setting the automatic pipeline cleanup for all projects on the replica with a small snippet:

for project in projects:
    try:
        project.ci_delete_pipelines_in_seconds = 180*24*60*60 # 180d
        project.save()
        print(f"Updated project {project.path_with_namespace}")
    except Exception as e:
        print(f"Failed to update project {project.path_with_namespace}: {e}")

This already removed over 100k artifacts and freed 20+GB of disk space (cleanup jobs are still running).

The only downside is somehow jobs and jobs logs are not available anymore, even for jobs newer than 15d. GitLab reports This job does not have a trace. if a job is opened.

I'll restore the replica again and try to set all cleanup and expire policies to a consistent 180d. I suspect the instance-wide Default artifacts expiration might cause job deletions.

The only downside is somehow jobs and jobs logs are not available anymore, even for jobs newer than 15d. GitLab reports This job does not have a trace. if a job is opened.

This has nothing to do with my cleanups but with excluding artifacts from the backups last week: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148804. Artifacts are just copied to the replicas for a full failover and will move to object storage soon (T378922). So this is expected and I forgot about that. Jobs before excluding artifacts from the backups are still available on the replica.

I'll do one last test with setting all retention and expiry dates to 180d on gitlab1003.

Jelto added subscribers: dancy, brennen.

I’ve set all retention and expiry dates to 180d on gitlab1003 (replica), both instance-wide and for all projects. This brought the number of artifacts down from 400k to 180k, though cleanup is still in progress. Disk usage dropped by around 40GB. I’ll keep an eye on how the cleanup continues and what the final numbers look like. I’ll also double-check job logs and try accessing jobs from before and after the 180-day mark.

@brennen @dancy do you have any concerns about setting Automatic pipeline cleanup to 180d across all projects on the production host? This will keep CI jobs for 180days, after that the job console output and files can no longer be accessed. I’d prefer to rely on simple instance-wide settings, but unfortunately they don’t seem to apply properly to either existing or new projects. So we’d need a small job somewhere to adjust the cleanup config. I’d use my ad-hoc script mentioned earlier and probably add a job in https://gitlab.wikimedia.org/repos/releng/gitlab-settings.

@Jelto Your plan looks reasonable to me. If it turns out that there are projects that have different retention requirements, we can add code to handle those exceptions.

Thanks for the feedback!

I created a dedicated artifact backup on gitlab2002 in case we need to restore anything:

jelto@gitlab2002:~$ sudo gitlab-backup create SKIP=db,repositories,uploads,pages,lfs,terraform_state,registry,packages,ci_secure_files,external_diffs BACKUP=artifacts

The summary for the test on gitlab1003 (replica) is: artifacts went down from 400k to 97k individual files, and disk usage dropped by around 55GB. Opening old jobs and artifacts still works for jobs newer than 6 months (I spot-checked that on the replica).

So I'll go ahead and set the expiry dates to 180d also on the production host. This should unblock the artifact migration to object storage then (T378922).

Mentioned in SAL (#wikimedia-operations) [2025-06-02T09:10:51Z] <jelto> update gitlab-settings artifact retention to 6 month - T395014

I updated the retention and cleanup settings on all GitLab instances and projects. I'll wait until the automatic cleanup jobs catch up and will monitor the progress.

The cleanup of old artifacts is done, the number of artifacts was reduced from 400k to around 100k. With 6mo of retention time there is still room for some improvements. But we decided to start with a rather conservative value here.

The last step for this task is to automate the project configuration and set the Automatic pipeline cleanup for all projects (especially new projects). I'll try to integrate this into configure-projects because this script already runs regularly on the GitLab host.

Change #1155152 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump gitlab-settings to v1.8.0

https://gerrit.wikimedia.org/r/1155152

Change #1155152 merged by Jelto:

[operations/puppet@production] gitlab: bump gitlab-settings to v1.8.0

https://gerrit.wikimedia.org/r/1155152

Change #1155542 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump gitlab-settings to v1.9.0

https://gerrit.wikimedia.org/r/1155542

Change #1155542 merged by Jelto:

[operations/puppet@production] gitlab: bump gitlab-settings to v1.9.0

https://gerrit.wikimedia.org/r/1155542

Change #1155545 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump gitlab-settings to v1.10.0

https://gerrit.wikimedia.org/r/1155545

Change #1155545 merged by Jelto:

[operations/puppet@production] gitlab: bump gitlab-settings to v1.10.0

https://gerrit.wikimedia.org/r/1155545

Change #1155568 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: bump gitlab-settings to v1.11.0

https://gerrit.wikimedia.org/r/1155568

Change #1155568 merged by Jelto:

[operations/puppet@production] gitlab: bump gitlab-settings to v1.11.0

https://gerrit.wikimedia.org/r/1155568

Automation to adjust the retention policy on a per-project level was added to gitlab-settings. So all projects use 6 month retention time now.

I'll resolve the task.