User Details
- User Since
- Aug 29 2023, 8:30 AM (118 w, 4 d)
- Availability
- Available
- IRC Nick
- arnaudb
- LDAP User
- Arnaudb
- MediaWiki User
- ABran-WMF [ Global Accounts ]
Wed, Dec 3
Fri, Nov 28
@Jelto do you think it could be worth implementing a check of this service at the end of the upgrade job on a primary gitlab instance?
Thu, Nov 27
Tue, Nov 25
error from T409835: Apt-staging: add alerting and fixed via https://gerrit.wikimedia.org/r/c/operations/alerts/+/1211001
@MoritzMuehlenhoff as @Dzahn mentioned, you might be interested to review what's been done on the gitlab-package-puller script. I'd be happy to have a chat about it if needed, let me know if you see any missing things or any room for improvement
as mentioned in T409833: Apt-staging: fix logging
We now have a bit more observability via metrics and logging:
Mon, Nov 24
docker has released a test version which now supports nftables
Thu, Nov 20
I've added alerts on reprepro more general script failures in operations/alerts/+/1207791
Tue, Nov 18
basic metrics are visible in T409833#11382292:
The updated version of the log output is a bit more verbose at info level:
{P85360}
Mon, Nov 17
T409832: Apt-staging: add error handling to gitlab_package_puller could be bundled with this task
Fri, Nov 14
Oct 17 2025
I think that can be closed now, feel free to reopen if needed!
Oct 15 2025
related to {T365259}
Oct 14 2025
same pattern as the one I was trying to fix in https://phabricator.wikimedia.org/T406403#11253645 → https://grafana.wikimedia.org/goto/ENnpKa6NR?orgId=1
This will be fixed by {T365259}
Oct 13 2025
Given T387833#11267438, that should not be an issue anymore, I'll send a patch to re-enable backups
things are looking better now:
arnaudb@gerrit2003:git $ fd | wc -l 236218 arnaudb@gerrit2003:git $ pwd /srv/gerrit/git arnaudb@gerrit2003:git $ ls -l /srv/backup/ total 0
vs the previous situation:
Oct 9 2025
Oct 8 2025
Oct 7 2025
DRY-RUN: Executing commands ['/usr/bin/rsync -avpPz --stats --delete /var/lib/gerrit2/review_site rsync://gerrit2003.wikimedia.org/gerrit-var-lib/ --no-o --no-g --chown=gerrit:gerrit '] on 1 hosts: gerrit1003.wikimedia.org DRY-RUN: Releasing lock for key sre.gerrit.failover with ID abe63737-d8ed-498c-af9d-f71d5fe4d64c
forced a stop/start of exim4 and wmf_auto_restart_exim4.service
Oct 6 2025
thanks for that @Jelto I'll try and reproduce the error in a controlled environment
Oct 3 2025
Oct 2 2025
good catch, indeed!
Oct 1 2025
mtail was not the source of the observability issue. I've fixed the promql query that renders the QoS event rates using this uptick as reference from the logs.
Sep 30 2025
This was a side effect of {T402611}, it's been worked around for now
change applied on gerrit1003, it should be back to normal now
I think I found the issue:
Sep 29 2025
the dry run switchover output looks alright {P83480}
CI went through, thanks for the fix!
Sep 26 2025
Sep 24 2025
root@lists1004:~ $ sudo systemctl stop exim4 root@lists1004:~ $ ps auxf|rg -i exim mtail 836 0.4 0.0 3308356 27304 ? Ssl May26 719:22 /usr/bin/mtail --progs /etc/mtail --logtostderr --address :: --port 3903 --logs /var/log/exim4/mainlog,/var/log/mailman/smtp,/var/log/mailman/subscribe -disable_fsnotify root 1201963 0.0 0.0 9124 5960 pts/0 S+ 10:20 0:00 \_ rg -i exim root 1193401 0.0 0.0 47848 27964 ? S 10:14 0:00 /usr/sbin/exim4 -q root 1193491 0.0 0.0 47848 22400 ? S 10:14 0:00 \_ /usr/sbin/exim4 -q Debian-+ 1193492 0.0 0.0 48004 24288 ? S 10:14 0:00 \_ /usr/sbin/exim4 -q root 1200450 0.8 0.0 47840 28080 ? S 10:19 0:00 /usr/sbin/exim4 -q root 1201885 0.0 0.0 47840 22524 ? S 10:20 0:00 \_ /usr/sbin/exim4 -q Debian-+ 1201886 0.0 0.0 47844 23988 ? S 10:20 0:00 \_ /usr/sbin/exim4 -q root 1201639 1.2 0.0 47844 28008 ? S 10:20 0:00 /usr/sbin/exim4 -q Debian-+ 1201964 0.0 0.0 0 0 ? R 10:20 0:00 \_ [exim4]