User Details
- User Since
- Jun 7 2021, 7:25 AM (257 w, 3 d)
- Availability
- Available
- IRC Nick
- jelto
- LDAP User
- Jelto
- MediaWiki User
- JWodstrcil (WMF) [ Global Accounts ]
Yesterday
Thank you very much for the detailed explanation!
Tue, May 12
I manually removed the design-strategy helm release in all environments (which includes the ingress mapping for /strategy). The /strategy endpoint should serve new content from the landing page now. Let me know if you need anything else.
@dduvall blubbers buildkit frontend uses nativ llb directly with T345458: Refactor Blubber's BuildKit frontend gateway to use LLB directly. So my understanding is dockerfile-copy is no longer used/needed by blubber?
Yes, during the object storage migration the default deletion policy was lowered to 180 days in https://gitlab.wikimedia.org/repos/releng/gitlab-settings/-/commit/a34e01d2ca41cf3adbced35fa7e2fa20c98c384e. So "artifacts" (CI job logs) use a reasonable cleanup policy. Old CI jobs are not available anymore in GitLab for example.
Mon, May 11
current quota usage:
This issues should be addressed in T423984: Add retries, error handling and metrics to sync-gitlab-group-with-ldap
The script failed because GitLab returned a 500 error:
Fri, May 8
@A_smart_kitten thank you for reporting the issue. I'll resolve the task, codesearch is back and works normally.
Thu, May 7
Wed, May 6
^ this happened again in T425476
Alert resolved after 5 minutes. The traffic was a short spike from known sources (seen before):
Thu, Apr 30
All hosts are migrated and cleanup has happened on all hosts, I'll resolve the task.
This happened because of a service restart in gerrit in T333143: Move Gerrit data out of root partition. Resolved after a few minutes.
Wed, Apr 29
All hosts are done, I'll resolve this task. Thank you @MoritzMuehlenhoff for the detailed checklist!
releases done and it's using the new discovery2026 cert:
doc hosts done and it's using the new discovery2026 cert:
resolved after 5 minutes and GitLab was responding normally for me. I linked this to T423985.
peopeweb done and it's using the new discovery2026 cert:
Phabricator done and it's using the new discovery2026 cert:
Aphlict done and it's using the new discovery2026 cert:
Etherpad done and it's using the new discovery2026 cert:
The sre.gitlab.upgrade cookbook creates a downtime now for the gitlab-backup-restore.service:
Thu, Apr 23
I'll resolve this task. Sub-tasks have been open for the follow up action items. The successful migration in T333143: Move Gerrit data out of root partition will prevent the root disk from filling up again with Gerrit cache files. We are also looking into improving the existing documentation and alerting (thanks @ABran-WMF ).
All hosts are migrated and cleanup has happened on gerrit-replica and gerrit-spare. I'll wait until next week for the gerrit production host cleanup (removing the symlink and the temporary backup folder /srv/gerrit/var-lib-gerrit-backup).
the alert resolved after 5 minutes, I'll not investigate this
Data on gerrit2003 production gerrit has also been moved to /srv/gerrit.
Runbook for the production host:
Tue, Apr 21
The replica host gerrit1003 has been migrated to /srv/gerrit.
The alert mentioned above is resolved after removing the manually created systemd unit file in /etc/systemd/system/gerrit.service on gerrit2002. All gerrit hosts use the puppet-managed unit file again.
Mon, Apr 20
I can't see the patches above and don't have the full context, but I think the user agent in https://requestctl.wikimedia.org/pattern/ua/gerrit_gitiles_legitimate_access_user_agent has to be escaped properly
There is an active incinga alert for the migrated spare host:
Thu, Apr 16
I successfully migrated the spare host gitlab2002 to /srv/gerrit.
I did another migration attempt and puppet keeps re-creating the /var/lib/gerrit even when profile::gerrit::gerrit_site: /srv/gerrit/site_path is set. This happenes because it's hardcoded in init.pp:
I've tested the migration snippet on the spare instance gerrit2002. Beside fixing a typo there is a problem with puppet re-creating the /var/lib/gerrit folder:
Apr 14 2026
This happened because of permission errors in https://gitlab.wikimedia.org/repos/projects/wmf-navigator, related to T414405.
Apr 13 2026
This alert recovered, probably related to T423027 or previous problems. I'll resolve the task.
This has caused disk space issues again in T423027. This task is already tagged with Sustainability (Incident Followup) .
Thank you to everyone who troubleshooted and mitigated this issue on a weekend.
thank you for opening this task. For additional context: paging alerts for Gerrit were also previously discussed in T365148.
This issue is tracked in T423027, I'll close this task.
Apr 10 2026
Puppet is happy on the test instance now.
I think this is related to enabling QoS for rsync in production: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1234984