User Details
- User Since
- Aug 29 2023, 8:30 AM (137 w, 4 d)
- Availability
- Available
- IRC Nick
- arnaudb
- LDAP User
- Arnaudb
- MediaWiki User
- ABran-WMF [ Global Accounts ]
Fri, Apr 17
I've also updated the "restart gerrit" part: https://wikitech.wikimedia.org/wiki/Gerrit/Operations#Restarting
I've added a bit of documentation on gerrit ssh commands: https://wikitech.wikimedia.org/wiki/Gerrit/Operations#Use_gerrit_ssh_commands
We should maybe have a prometheus metric exposed for 500 errors thrown from LDAP and/or failed syncs. That error looks transient because it recovers quickly (it was also the case this morning with T423674: SystemdUnitFailed - sync-gitlab-group-with-ldap.service on gitlab1004:9100), so it should maybe be caught/retried to generate an alert only if it lasts more than a few minutes/hours.
I have tried to limit max_concurrent_streams to 50, still inconclusive for the connection interruption in CI
a 500 error was thrown and the script crashed. It was a transient error
Thu, Apr 16
I have tried to increase per_connection_buffer_limit_bytes first to 4MB, then to 16MB to see if the stream watermark messages were a symptoms of the 502s. The message rate has reduced, but 502s are still happening (https://integration.wikimedia.org/ci/view/All/job/quibble-vendor-mysql-php83/lastFailedBuild/console). I'll keep on debugging.
Wed, Apr 15
quibble-with-gated-extensions-vendor-mysql-php83/27335 also failed during the clone phase with the same error. The failure happens in prepareRepo() / repo.prune() / git remote prune --dry-run origin:
Verbose logging has been enabled on Envoy on gerrit2003:
curl -i -X POST 'http://localhost:9631/logging?paths=http:debug,router:debug,connection:debug'
Tue, Apr 14
linked to T423156: setup phab1006
We might want to stop puppet first and have a merged change that will update httpd's config
Thu, Apr 9
No impact from that change over the past few hours:
root@contint1002:/srv/jenkins/builds# for i in $(fdfind --type file --changed-within 2h '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
7I will keep on digging
Envoy now stops reusing connections to httpd on our Gerrit primary instance. I'll monitor the impact on the GnuTLS recv error rate.
A first manual test on gerrit-spare shows no breakage, I will now try to apply that change manually on the primary Gerrit instance before merging.
Transactions: 408 hits Availability: 100.00 %
Thanks @DLynch @ArthurTaylor for reporting these builds.
Wed, Apr 8
Connection reuse has been disabled again. Please let me know if the 502 errors are still happening.
Tue, Apr 7
alerts were stemming from esams maintenance.
thanks for highlighting this @taavi, the change has been merged to move these entries to state: production, feel free to reopen that issue if needed.
I forgot to add on my previous comment:
# for i in $(fdfind --type f --change-newer-than 2026-03-25 '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
114There is a clear increase of these errors in the logs on contint, after we introduced Envoy and enabled connection-reuse.
Fri, Apr 3
I think that was resolved by @Dzahn
Thanks for the follow-up @SomeRandomDeveloper! Given the results of our recent tweaks on Envoy, it is a bit hard to debug further without more details.
Would it be possible to run that job with a bit more verbosity? Something like GIT_CURL_VERBOSE=1 GIT_TRACE_CURL=1 git $command -v.
Thu, Apr 2
@Dzahn closing that one because we removed these lines with T420909: gerrit: Add Envoy in Gerrit's stack
As mentioned in T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs:
We need to update permissions on wmf-navigator, marking this Stalled until then
thanks for these @SomeRandomDeveloper @DLynch, I've merged a config update. Please let me know if you still see these 502 errors.
got it: 2026-04-02 07:44:44,661 ERROR [gitlab_package_puller.GitlabPackagePuller] Skipping project repos/projects/wmf-navigator after GitLab API error while preparing package fetch: 403: 403 Forbidden
agreed @Arnoldokoth it seems a permission on Gitlab might have been changed, preventing the script to access a repo. The error is thrown in after the Zotero line:
Wed, Apr 1
The change has been merged, please let us know if that does not fix the situation.
@ArthurTaylor @SomeRandomDeveloper is it OK for me to retry some of these jobs to test my change?
thanks for raising these, I'll check
Tue, Mar 31
follow up done in T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026 and T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs:
Envoy upstream timeout has been updated to allow longer git fetch commands, let me know if that issue is still present after that.
after merging that config change, mediawiki-core-phpmetrics (from T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026) now shows a successful build, @Jdlrobson @DLynch please let me know if your issue is still present after that change.
@dancy I've updated Envoy's configuration to increase the upstream_response_timeout to 300s instead of 120 → I've triggered a mediawiki-core-phpmetrics build and it went OK: https://integration.wikimedia.org/ci/job/mediawiki-core-phpmetrics/705/console
Envoy has been added to our stack, we need to follow up in T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs to adapt configurations
Mon, Mar 30
thanks @dancy for highlighting this, that is probably stemming from T420909: gerrit: Add Envoy in Gerrit's stack.
I think operations/puppet/+/1262020 could fix it, wdyt @hashar?
The new training flow keeps the existing VRTS export unchanged: vrts.TicketExport2Mbox.pl still produces mbox input in /var/spool/spam/{spam,ham}, and SpamAssassin is still trained with sa-learn --mbox and spamassassin --add-to-blacklist/--add-to-whitelist --mbox.
Fri, Mar 27
we tweaked several knobs on httpd and Envoy and still have the same underlying issue, I think aligning Jetty with the rest of the timers could yield more results
We moved all gerrit instances behind CDN, they all have the same DNS pattern
Thu, Mar 26
I've been able to test my change on Pontoon:
Hey @neriah, thanks for raising that issue. With {T420595} and T420909: gerrit: Add Envoy in Gerrit's stack the situation should be improved.
I'm marking this task as resolved, please let us know if it does not, or it does not sufficiently, and feel free to reopen this task.