User Details
- User Since
- Jun 7 2021, 7:25 AM (234 w, 5 d)
- Availability
- Available
- IRC Nick
- jelto
- LDAP User
- Jelto
- MediaWiki User
- JWodstrcil (WMF) [ Global Accounts ]
Yesterday
Thanks @Arnoldokoth for enabling the cleanup job again. The inode metrics look much better now and stabilized at around 2.5% usage. Also thanks for the new alert, which should warn us before we see the no space left on device errors.
There is an open alert since 11 days Linting problems found for GitlabPackagePullerFailedOnRun:
Thu, Dec 4
I don't think a check will help. We can either
Wed, Dec 3
The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&timezone=utc&var-node=vrts1003&viewPanel=panel-30. So I like the idea of enabling the cleanup job again.
Tue, Dec 2
This has happened again in T411452
Mon, Dec 1
Oct 29 2025
Oct 28 2025
After discussing with @MoritzMuehlenhoff I removed the duplicate package from /incoming on apt-staging2001. We also talked about cleaning the /incoming folder regularly with a job to prevent issues like this in the future.
Oct 27 2025
Thank you! I thank that can be resolved, the logo looks good to me in a fresh browser. My current session uses the old logo somehow.
Thanks @Aklapper for submitting the change. I tweaked the colors a bit more, see my comment in the change.
Oct 23 2025
Yes that was one of the files I was looking at. However the wmfmariadbpy package creates some more .deb files, see:
Oct 22 2025
Happened because of the upgrade T407943
All instances were updated successfully. I'll resolve the task.
This happened because of the reboot in T407110. It resolved automatically.
Test hosts done, I'll proceed with the replicas soon.
I'll bump the apt package to 18.3* and start upgrading the test hosts.
Oct 21 2025
Oct 20 2025
My new FIDO ssh key was added and works and the old ssh key was removed. I'll resolve the task.
Oct 17 2025
Oct 13 2025
The gitlab-ce and gitlab-runner packages were removed from the Update section in the apt repo config. So the old buster and bullseye packages should no longer be updated. We agreed to keep the old packages around to make sure reprepro is not complaining.
Oct 9 2025
Great, then I'll resolve this task. I opened T406824: Evaluate generic backup tooling for object storage buckets as a follow up to track the object storage backup work in a more generic way.
Today is another scheduled maintenance for Gerrit: T387833. So this task is blocked again.
Related to the upgrade in T406801.
Oct 8 2025
Deployment-wise everything is done, gitlab artifacts and packages use the object storage backend now.
reggie is quite a critical component of the non-production/dev CI stack. So maybe it's possible to make it a bit more resilient or HA (which is probably tricky because of the persistent volume).
The reggie-01 pod was moved to another node and the pvc was still stuck on the old node. I guess that happened during automatic scale down of the cluster.
Oct 7 2025
The puppet issue is fixed, I'll resolve the task.
The change above increased the paging delay to 15 minutes. So short traffic bursts should not page immediately. For paging Phabricator has to be slow (latency 6s) for at least 15 minutes.
Oct 6 2025
In my opinion no downtime is needed for gitlab-runners.
I'll bump the alert_after threshold from 5m to 15m (also to reduce alerting noise for the new 24/7 shift).
I tried to extract the error from the failed cookbook execution from the cumin log:
This was a small traffic burst which automatically resolved. I'll resolve the task, we can do more investigation if that happens again.
Oct 2 2025
Thank you @dancy , I'll take a look!
I don't have the full context here, but if the new gitlab-packages fileset (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192535) causes any trouble, this can be disabled for a few days/weeks from our side. The gitlab fileset is much more important and should be much smaller now, around 50-100GBs instead of 400GB+ like in the past.
Today the metamonitoring pa.ged on-call SREs. When checking the metamonitoring_public_endpoint.service on alert1002 I see a python stack trace during the time of the the alert:
Oct 1 2025
The update is blocked by T406094 currently
Sep 30 2025
All GitLab packages are migrated to APUS object storage. The additional sync from bucket to the backup directory /srv/gitlab-backup also works and performance of the APUS cluster is great. A small fix was needed (see patch above) to make sure s3cmd does not pull all packages on every run. But after fixing the trailing slashes in the folder path the sync works without generating too much traffic.
Sep 29 2025
This happened because of T378922, resolved after deploying a fix
Sync from object storage to a local folder works with the new ceph::client::sync_local module. I tested this on GitLab replica gitlab2002: