Page MenuHomePhabricator

Jelto (jwodstrcil)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
Jun 7 2021, 7:25 AM (234 w, 5 d)
Availability
Available
IRC Nick
jelto
LDAP User
Jelto
MediaWiki User
JWodstrcil (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Jelto closed T411452: No space left on device on VRTS host as Resolved.

Thanks @Arnoldokoth for enabling the cleanup job again. The inode metrics look much better now and stabilized at around 2.5% usage. Also thanks for the new alert, which should warn us before we see the no space left on device errors.

Fri, Dec 5, 2:44 PM · Wikimedia-Incident, collaboration-services, SRE, vrts, Znuny
Jelto reopened T409835: Apt-staging: add alerting as "Open".

There is an open alert since 11 days Linting problems found for GitlabPackagePullerFailedOnRun:

Fri, Dec 5, 2:25 PM · collaboration-services

Thu, Dec 4

Jelto added a comment to T411240: SystemdUnitFailed - sync-gitlab-group-with-ldap.service on gitlab1004:9100.

I don't think a check will help. We can either

Thu, Dec 4, 10:19 AM · collaboration-services
Jelto added a project to T403125: Investigate WMCS Magnum for GitLab runners: collaboration-services.
Thu, Dec 4, 9:42 AM · collaboration-services, Release-Engineering-Team (Priority Backlog 📥), GitLab (CI & Job Runners)

Wed, Dec 3

Jelto added a comment to T411452: No space left on device on VRTS host.

The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&timezone=utc&var-node=vrts1003&viewPanel=panel-30. So I like the idea of enabling the cleanup job again.

Wed, Dec 3, 1:58 PM · Wikimedia-Incident, collaboration-services, SRE, vrts, Znuny

Tue, Dec 2

Jelto added a comment to T406762: gerrit2003 is trying to backup incrementally 3.5 million files every hour, clogging backus and filling in available disk space.

This task was for gerrit2003 which had a 55GB backup because, as part of a migration, the git repositories have been entirely rewritten. That specific cause has been resolved for now (but will trigger again next time the repos are wiped and regenerated via parent T387833).

@jcrespo for gerrit1003, that is the primary server and that is certainly a different issue. Can you please file a new task for that? If you have examples of files being touched that would help. Thanks!

Tue, Dec 2, 2:11 PM · Patch-For-Review, collaboration-services, Gerrit
Jelto added a comment to T409189: Free up inodes on vrts1003.

This has happened again in T411452

Tue, Dec 2, 8:38 AM · collaboration-services

Mon, Dec 1

Jelto added a comment to T409832: Apt-staging: add error handling to gitlab_package_puller.

@MoritzMuehlenhoff as @Dzahn mentioned, you might be interested to review what's been done on the gitlab-package-puller script. I'd be happy to have a chat about it if needed, let me know if you see any missing things or any room for improvement

The changes to the puller script and the alerting look good to me, I'm wondering however if this shouldn't rather alert the "sre" team instead of just collaboration? If the alerts are flagged in #wikimedia-operations this would also allow to see the people who uploaded a broken package that something failed?

Mon, Dec 1, 4:41 PM · collaboration-services
Jelto added a comment to T410756: SystemdUnitFailed - zuul-executor.service on zuul2002.

Ah, I had moved this host back into "insetup" role back in October: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198128

That removed the docker.io package but it did not delete zuul services.

Not exactly sure why the alert started only now.

But first fix is to manually delete the service files:

root@zuul2001:/lib/systemd/system# rm zuul-
zuul-nodepool.service   zuul-scheduler.service  zuul-web.service
Mon, Dec 1, 3:32 PM · collaboration-services
Jelto updated Jelto.
Mon, Dec 1, 8:54 AM

Oct 29 2025

Jelto updated Jelto.
Oct 29 2025, 5:02 PM

Oct 28 2025

Jelto created T408527: Cleanup incoming folder on apt-staging regularly.
Oct 28 2025, 9:11 AM · collaboration-services, GitLab (Infrastructure)
Jelto closed T408109: python3-wmfmariadbpy fails to import on apt-staging with "already registered with different checksums" as Resolved.

After discussing with @MoritzMuehlenhoff I removed the duplicate package from /incoming on apt-staging2001. We also talked about cleaning the /incoming folder regularly with a job to prevent issues like this in the future.

Oct 28 2025, 9:00 AM · GitLab, collaboration-services, Infrastructure-Foundations, DBA
Jelto closed T408109: python3-wmfmariadbpy fails to import on apt-staging with "already registered with different checksums", a subtask of T407845: Build python3-wmfmariadbpy and wmfmariadbpy-common for Debian Trixie, as Resolved.
Oct 28 2025, 9:00 AM · Infrastructure-Foundations, DBA

Oct 27 2025

Jelto renamed T408303: ProbeDown (gerrit1003:443) from ProbeDown to ProbeDown (gerrit1003:443).
Oct 27 2025, 8:56 AM · Release-Engineering-Team, collaboration-services
Jelto added a parent task for T408303: ProbeDown (gerrit1003:443): Unknown Object (Task).
Oct 27 2025, 8:55 AM · Release-Engineering-Team, collaboration-services
Jelto closed T407993: Gitlab logo in light theme is not very defined as Resolved.

Thank you! I thank that can be resolved, the logo looks good to me in a fresh browser. My current session uses the old logo somehow.

Oct 27 2025, 8:55 AM · GitLab
Jelto added a comment to T407993: Gitlab logo in light theme is not very defined.

Thanks @Aklapper for submitting the change. I tweaked the colors a bit more, see my comment in the change.

Oct 27 2025, 8:48 AM · GitLab

Oct 23 2025

Jelto added a parent task for T408113: ProbeDown - gerrit1003: Unknown Object (Task).
Oct 23 2025, 1:23 PM · Release-Engineering-Team, collaboration-services
Jelto added a comment to T408109: python3-wmfmariadbpy fails to import on apt-staging with "already registered with different checksums".

Yes that was one of the files I was looking at. However the wmfmariadbpy package creates some more .deb files, see:

Oct 23 2025, 1:20 PM · GitLab, collaboration-services, Infrastructure-Foundations, DBA
Jelto created T408109: python3-wmfmariadbpy fails to import on apt-staging with "already registered with different checksums".
Oct 23 2025, 12:29 PM · GitLab, collaboration-services, Infrastructure-Foundations, DBA
Jelto updated the task description for T408064: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy).
Oct 23 2025, 8:16 AM · Patch-For-Review, collaboration-services, vm-requests, Infrastructure-Foundations, SRE
Jelto added a parent task for T408050: ProbeDown (gerrit1003): Unknown Object (Task).
Oct 23 2025, 7:16 AM · Release-Engineering-Team, collaboration-services
Jelto renamed T408050: ProbeDown (gerrit1003) from ProbeDown to ProbeDown (gerrit1003).
Oct 23 2025, 7:16 AM · Release-Engineering-Team, collaboration-services
Jelto added a project to T408064: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy): collaboration-services.
Oct 23 2025, 7:03 AM · Patch-For-Review, collaboration-services, vm-requests, Infrastructure-Foundations, SRE
Jelto added a parent task for T408064: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy): Unknown Object (Task).
Oct 23 2025, 7:03 AM · Patch-For-Review, collaboration-services, vm-requests, Infrastructure-Foundations, SRE
Jelto created T408064: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy).
Oct 23 2025, 7:03 AM · Patch-For-Review, collaboration-services, vm-requests, Infrastructure-Foundations, SRE

Oct 22 2025

Jelto renamed T407973: SystemdUnitFailed (full-backup.service on gitlab1004:9100) from SystemdUnitFailed to SystemdUnitFailed (full-backup.service on gitlab1004:9100).
Oct 22 2025, 12:24 PM · collaboration-services
Jelto added a subtask for T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5: T407973: SystemdUnitFailed (full-backup.service on gitlab1004:9100).
Oct 22 2025, 12:24 PM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto added a parent task for T407973: SystemdUnitFailed (full-backup.service on gitlab1004:9100): T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.
Oct 22 2025, 12:24 PM · collaboration-services
Jelto closed T407973: SystemdUnitFailed (full-backup.service on gitlab1004:9100) as Resolved.

Happened because of the upgrade T407943

Oct 22 2025, 12:24 PM · collaboration-services
Jelto closed T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5 as Resolved.

All instances were updated successfully. I'll resolve the task.

Oct 22 2025, 12:18 PM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto updated the task description for T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.
Oct 22 2025, 12:17 PM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto updated the task description for T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.
Oct 22 2025, 12:13 PM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto updated the task description for T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.
Oct 22 2025, 10:55 AM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto closed T407952: SystemdUnitFailed (rsync-lfs_replica_sync.service on gerrit2002) as Resolved.

This happened because of the reboot in T407110. It resolved automatically.

Oct 22 2025, 8:28 AM · collaboration-services
Jelto renamed T407952: SystemdUnitFailed (rsync-lfs_replica_sync.service on gerrit2002) from SystemdUnitFailed to SystemdUnitFailed (rsync-lfs_replica_sync.service on gerrit2002).
Oct 22 2025, 8:28 AM · collaboration-services
Jelto added a comment to T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.

Test hosts done, I'll proceed with the replicas soon.

Oct 22 2025, 7:45 AM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto updated the task description for T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.
Oct 22 2025, 7:44 AM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto moved T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5 from Incoming to Work in Progress on the collaboration-services board.
Oct 22 2025, 7:07 AM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto claimed T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.

I'll bump the apt package to 18.3* and start upgrading the test hosts.

Oct 22 2025, 7:07 AM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security
Jelto created T407943: GitLab Patch Release: 18.5.1, 18.4.3, 18.3.5.
Oct 22 2025, 7:06 AM · Vuln-VulnComponent, SecTeam-Processed, collaboration-services, GitLab (Infrastructure), Security

Oct 21 2025

Jelto added a comment to T407557: OpenSSH 10.1+ warns that Wikimedia SSH does not use post-quantum key exchange algorithm.

Assuming gitlab ssh config needs changing, this needs attention of collaboration-services team. Production seems to be handled already.

Oct 21 2025, 8:54 AM · Release-Engineering-Team, Infrastructure-Foundations, GitLab

Oct 20 2025

Jelto closed T407606: Enroll Jeltos YubiKey for production access as Resolved.

My new FIDO ssh key was added and works and the old ssh key was removed. I'll resolve the task.

Oct 20 2025, 2:58 PM · SRE, SRE-Access-Requests

Oct 17 2025

Jelto created T407606: Enroll Jeltos YubiKey for production access.
Oct 17 2025, 9:07 AM · SRE, SRE-Access-Requests

Oct 13 2025

Jelto closed T406823: Remove gitlab-ce and gitlab-runner bullseye package and component as Resolved.

The gitlab-ce and gitlab-runner packages were removed from the Update section in the apt repo config. So the old buster and bullseye packages should no longer be updated. We agreed to keep the old packages around to make sure reprepro is not complaining.

Oct 13 2025, 12:08 PM · collaboration-services, GitLab
Jelto updated the task description for T406823: Remove gitlab-ce and gitlab-runner bullseye package and component.
Oct 13 2025, 12:05 PM · collaboration-services, GitLab
Jelto moved T406823: Remove gitlab-ce and gitlab-runner bullseye package and component from Incoming to Work in Progress on the collaboration-services board.
Oct 13 2025, 11:52 AM · collaboration-services, GitLab
Jelto updated the task description for T406823: Remove gitlab-ce and gitlab-runner bullseye package and component.
Oct 13 2025, 11:51 AM · collaboration-services, GitLab
Jelto claimed T406823: Remove gitlab-ce and gitlab-runner bullseye package and component.
Oct 13 2025, 11:51 AM · collaboration-services, GitLab
Jelto added a project to T406823: Remove gitlab-ce and gitlab-runner bullseye package and component: collaboration-services.
Oct 13 2025, 7:53 AM · collaboration-services, GitLab

Oct 9 2025

Jelto updated subscribers of T403946: Split repository backups (gerrit and gitlab) into its own dedicated storage daemons.

Is it ok to deploy now?

Oct 9 2025, 3:54 PM · collaboration-services, GitLab, bacula, Data-Persistence-Backup
Jelto closed T378922: Migrate gitlab storage to apus (also: backups from S3?) as Resolved.

Great, then I'll resolve this task. I opened T406824: Evaluate generic backup tooling for object storage buckets as a follow up to track the object storage backup work in a more generic way.

Oct 9 2025, 8:48 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto updated the task description for T378922: Migrate gitlab storage to apus (also: backups from S3?).
Oct 9 2025, 8:42 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto created T406824: Evaluate generic backup tooling for object storage buckets.
Oct 9 2025, 8:42 AM · Data-Persistence-Backup, collaboration-services, SRE-swift-storage, Ceph
Jelto added a comment to T403946: Split repository backups (gerrit and gitlab) into its own dedicated storage daemons.

Today is another scheduled maintenance for Gerrit: T387833. So this task is blocked again.

Oct 9 2025, 8:30 AM · collaboration-services, GitLab, bacula, Data-Persistence-Backup
Jelto created T406823: Remove gitlab-ce and gitlab-runner bullseye package and component.
Oct 9 2025, 8:22 AM · collaboration-services, GitLab
Jelto closed T406810: SystemdUnitFailed (backup-restore.service on gitlab2002) as Resolved.

Related to the upgrade in T406801.

Oct 9 2025, 8:01 AM · collaboration-services
Jelto added a parent task for T406810: SystemdUnitFailed (backup-restore.service on gitlab2002): Unknown Object (Task).
Oct 9 2025, 8:00 AM · collaboration-services
Jelto renamed T406810: SystemdUnitFailed (backup-restore.service on gitlab2002) from SystemdUnitFailed to SystemdUnitFailed (backup-restore.service on gitlab2002).
Oct 9 2025, 8:00 AM · collaboration-services

Oct 8 2025

Jelto added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

Deployment-wise everything is done, gitlab artifacts and packages use the object storage backend now.

Oct 8 2025, 2:03 PM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto updated the task description for T378922: Migrate gitlab storage to apus (also: backups from S3?).
Oct 8 2025, 1:57 PM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto moved T406733: registry.cloud.releng.team returning 503s from Incoming to Consultation on the collaboration-services board.
Oct 8 2025, 1:43 PM · Patch-For-Review, collaboration-services, GitLab (CI & Job Runners)
Jelto added a project to T406733: registry.cloud.releng.team returning 503s: collaboration-services.
Oct 8 2025, 1:43 PM · Patch-For-Review, collaboration-services, GitLab (CI & Job Runners)
Jelto added a comment to T406733: registry.cloud.releng.team returning 503s.

reggie is quite a critical component of the non-production/dev CI stack. So maybe it's possible to make it a bit more resilient or HA (which is probably tricky because of the persistent volume).

Oct 8 2025, 1:43 PM · Patch-For-Review, collaboration-services, GitLab (CI & Job Runners)
Jelto added a comment to T406733: registry.cloud.releng.team returning 503s.

The reggie-01 pod was moved to another node and the pvc was still stuck on the old node. I guess that happened during automatic scale down of the cluster.

Oct 8 2025, 1:33 PM · Patch-For-Review, collaboration-services, GitLab (CI & Job Runners)

Oct 7 2025

Jelto added projects to T405742: tofu-provisioning: Failed to install provider: collaboration-services, GitLab (CI & Job Runners).
Oct 7 2025, 2:24 PM · GitLab (CI & Job Runners), collaboration-services, cloud-services-team (FY2025/26-Q1-Q2), Toolforge
Jelto closed T406234: Puppet failure on gitlab-1002.devtools.eqiad1.wikimedia.cloud as Resolved.

The puppet issue is fixed, I'll resolve the task.

Oct 7 2025, 10:18 AM · collaboration-services, GitLab, VPS-project-devtools
Jelto closed T406338: ProbeDown (phab1004) as Resolved.

The change above increased the paging delay to 15 minutes. So short traffic bursts should not page immediately. For paging Phabricator has to be slow (latency 6s) for at least 15 minutes.

Oct 7 2025, 9:19 AM · collaboration-services

Oct 6 2025

Jelto added a comment to T405940: eqiad row C/D Collaboration Services host migrations.

In my opinion no downtime is needed for gitlab-runners.

Oct 6 2025, 3:47 PM · collaboration-services, SRE, DC-Ops, ops-eqiad
Jelto moved T406338: ProbeDown (phab1004) from Incoming to Work in Progress on the collaboration-services board.
Oct 6 2025, 3:27 PM · collaboration-services
Jelto reopened T406338: ProbeDown (phab1004) as "Open".

I'll bump the alert_after threshold from 5m to 15m (also to reduce alerting noise for the new 24/7 shift).

Oct 6 2025, 3:27 PM · collaboration-services
Jelto moved T406234: Puppet failure on gitlab-1002.devtools.eqiad1.wikimedia.cloud from Incoming to Work in Progress on the collaboration-services board.
Oct 6 2025, 2:04 PM · collaboration-services, GitLab, VPS-project-devtools
Jelto claimed T406234: Puppet failure on gitlab-1002.devtools.eqiad1.wikimedia.cloud.
Oct 6 2025, 2:04 PM · collaboration-services, GitLab, VPS-project-devtools
Jelto added a comment to T387833: Gerrit failover process.

I tried to extract the error from the failed cookbook execution from the cumin log:

Oct 6 2025, 1:21 PM · Patch-For-Review, collaboration-services
Jelto added a subtask for T387833: Gerrit failover process: T406482: SystemdUnitFailed (jenkins.service on contint1002).
Oct 6 2025, 12:54 PM · Patch-For-Review, collaboration-services
Jelto added a parent task for T406482: SystemdUnitFailed (jenkins.service on contint1002): T387833: Gerrit failover process.
Oct 6 2025, 12:54 PM · collaboration-services
Jelto renamed T406482: SystemdUnitFailed (jenkins.service on contint1002) from SystemdUnitFailed to SystemdUnitFailed (jenkins.service on contint1002).
Oct 6 2025, 12:54 PM · collaboration-services
Jelto added a parent task for T406338: ProbeDown (phab1004): Unknown Object (Task).
Oct 6 2025, 9:29 AM · collaboration-services
Jelto closed T406338: ProbeDown (phab1004) as Resolved.

This was a small traffic burst which automatically resolved. I'll resolve the task, we can do more investigation if that happens again.

Oct 6 2025, 9:29 AM · collaboration-services
Jelto added a comment to T403946: Split repository backups (gerrit and gitlab) into its own dedicated storage daemons.

@Jelto @Dzahn The backup seems to be progressing correctly. I will revert the test and we can try to deploy the new configuration on Monday.

Oct 6 2025, 9:23 AM · collaboration-services, GitLab, bacula, Data-Persistence-Backup
Jelto renamed T406338: ProbeDown (phab1004) from ProbeDown to ProbeDown (phab1004).
Oct 6 2025, 6:32 AM · collaboration-services

Oct 2 2025

Jelto added a project to T406234: Puppet failure on gitlab-1002.devtools.eqiad1.wikimedia.cloud: collaboration-services.

Thank you @dancy , I'll take a look!

Oct 2 2025, 2:59 PM · collaboration-services, GitLab, VPS-project-devtools
Jelto added a comment to T403946: Split repository backups (gerrit and gitlab) into its own dedicated storage daemons.

I don't have the full context here, but if the new gitlab-packages fileset (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192535) causes any trouble, this can be disabled for a few days/weeks from our side. The gitlab fileset is much more important and should be much smaller now, around 50-100GBs instead of 400GB+ like in the past.

Oct 2 2025, 2:48 PM · collaboration-services, GitLab, bacula, Data-Persistence-Backup
Jelto added a project to T403946: Split repository backups (gerrit and gitlab) into its own dedicated storage daemons: collaboration-services.
Oct 2 2025, 11:45 AM · collaboration-services, GitLab, bacula, Data-Persistence-Backup
Jelto added a comment to T397003: Configure a prometheus dead man's snitch alert.

Today the metamonitoring pa.ged on-call SREs. When checking the metamonitoring_public_endpoint.service on alert1002 I see a python stack trace during the time of the the alert:

Oct 2 2025, 6:54 AM · Observability-Alerting

Oct 1 2025

Jelto updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Oct 1 2025, 3:13 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops
Jelto added a comment to T405703: Update wikikube eqiad to kubernetes 1.31.

@BTullis @bking there is an undeployed change for flink in the dse-k8s-eqiad cluster which made it a bit tricky to deploy the new wikikube-eqiad CIDRs (to update network policies).

Oct 1 2025, 2:45 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops
Jelto added a comment to T405703: Update wikikube eqiad to kubernetes 1.31.

The update is blocked by T406094 currently

Oct 1 2025, 10:05 AM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops

Sep 30 2025

Jelto added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

Do we need daily full backups for objects? Assuming only a few object change per day, cannot we do incrementals? We currently don't have the space to backup daily 500 GB until we dedicate the storage for gitlab.

Sorry, I am not sure if you refer to gitlab package or object backup. The packages should be full, but I was hoping to do incrementals for objects.

Sep 30 2025, 12:01 PM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto updated the task description for T378922: Migrate gitlab storage to apus (also: backups from S3?).
Sep 30 2025, 10:32 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

All GitLab packages are migrated to APUS object storage. The additional sync from bucket to the backup directory /srv/gitlab-backup also works and performance of the APUS cluster is great. A small fix was needed (see patch above) to make sure s3cmd does not pull all packages on every run. But after fixing the trailing slashes in the folder path the sync works without generating too much traffic.

Sep 30 2025, 10:32 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE

Sep 29 2025

Jelto updated the task description for T378922: Migrate gitlab storage to apus (also: backups from S3?).
Sep 29 2025, 2:49 PM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto updated the task description for T378922: Migrate gitlab storage to apus (also: backups from S3?).
Sep 29 2025, 2:48 PM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Sep 29 2025, 1:48 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops
Jelto updated the task description for T405703: Update wikikube eqiad to kubernetes 1.31.
Sep 29 2025, 1:41 PM · Discovery-Search (2025.09.26 - 2025.10.17), Data-Platform-SRE (2025.09.26 - 2025.10.17), Patch-For-Review, collaboration-services, Kubernetes, Prod-Kubernetes, serviceops
Jelto renamed T405868: SystemdUnitFailed (s3-sync-bucket.service) from SystemdUnitFailed to SystemdUnitFailed (s3-sync-bucket.service).
Sep 29 2025, 1:16 PM · collaboration-services
Jelto closed T405868: SystemdUnitFailed (s3-sync-bucket.service) as Resolved.

This happened because of T378922, resolved after deploying a fix

Sep 29 2025, 1:15 PM · collaboration-services
Jelto updated the task description for T378922: Migrate gitlab storage to apus (also: backups from S3?).
Sep 29 2025, 12:15 PM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
Jelto added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

Sync from object storage to a local folder works with the new ceph::client::sync_local module. I tested this on GitLab replica gitlab2002:

Sep 29 2025, 8:16 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE