I've recently been pinged by @Jdlrobson and @dancy on 2 different CI jobs that were failing because of timeouts being reached.
T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026 has been created for that matter.
I think https://gerrit.wikimedia.org/r/1262020 could help with that, but there is probably other timers to adapt.
Description
Details
- Other Assignee
- hashar
Event Timeline
Change #1262020 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: adjust idleTimeout on Jetty
Change #1265322 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: update Envoy upstream response timeout
Change #1265322 merged by Arnaudb:
[operations/puppet@production] gerrit: update Envoy upstream response timeout
after merging that config change, mediawiki-core-phpmetrics (from T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026) now shows a successful build, @Jdlrobson @DLynch please let me know if your issue is still present after that change.
Saw this issue again this morning: https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/24578/console
Also seen a couple hours ago in https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/24509/console, https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/24504/console and https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/24538/console
And this - different but maybe related: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php83-selenium/38340/console
Change #1266181 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: update upstream_response_timeout for Envoy
@ArthurTaylor @SomeRandomDeveloper is it OK for me to retry some of these jobs to test my change?
@ABran-WMF sure, but I think my patches have now been merged (by retrying a bunch of times)
The jobs I linked are also associated with patches that are already merged. For me at least, the errors don't seem to occur consistently (the jobs usually pass after rerunning them once), so retrying probably wouldn't provide any information on whether the issue is fixed or not
That is Envoy cutting the connection because the response takes too long to start. That blame URL probably takes more than 2 or 5 minutes. That is probably due to the upstream_response_timeout which used to be set at 120 and got bumped to 300 (Envoy defaults to 0: never timeout while waiting for backend).
Change #1266181 merged by Arnaudb:
[operations/puppet@production] gerrit: update upstream_response_timeout for Envoy
Change #1266950 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: fix Envoy idle timeout handling for slow HTTPS git request
Change #1266950 merged by Arnaudb:
[operations/puppet@production] gerrit: fix Envoy idle timeout handling for slow HTTPS git requests
thanks for these @SomeRandomDeveloper @DLynch, I've merged a config update. Please let me know if you still see these 502 errors.
Change #1266962 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: update upstream_idle_timeout
Change #1266962 merged by Arnaudb:
[operations/puppet@production] gerrit: update upstream_idle_timeout
Thanks, unfortunately the errors are still occuring: https://integration.wikimedia.org/ci/job/quibble-composer-mysql-php81-selenium/25520/console
Thanks for the follow-up @SomeRandomDeveloper! Given the results of our recent tweaks on Envoy, it is a bit hard to debug further without more details.
Would it be possible to run that job with a bit more verbosity? Something like GIT_CURL_VERBOSE=1 GIT_TRACE_CURL=1 git $command -v.
Even if the job does not always fail, it could help if we managed to catch a more detailed output. It is not clear in the build log if the connection is cut before headers are sent or while transferring data.
17:10:06 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CentralAuth/': GnuTLS recv error (-54): Error in the pull function.'
To be clear, this is basically just speculation based on the limited information I can find; but I am wondering about whether this type of error might potentially be unrelated to the addition of Envoy to Gerrit's stack.
Searching Gerrit for the text GnuTLS recv error (-54) (to try and find instances where this CI error has been pasted into a Gerrit comment), it seems like an early singular occurrence of this issue may have been on 2026-02-18, with occurrences starting regularly from 2026-03-16 onwards. IIUC from T420909#11752446, Envoy was added to Gerrit's stack on 2026-03-26.
(side-note: it'd be really nice if there was a way to search within previous CI logs!)
The git commands are done from the zuul-cloner docker image that's build based on integration/zuul.git (our very dated decade+-old fork of upstream).
It's possible to make a verbose patch there, land it, build a new docker image and hope it works (first new build in five years), switch all of CI into using this new image in verbose mode (and hope it's not too slow), and then wait for / hope that it runs into this issue so we get more debugging, but it is very tedious.
Perhaps @hashar might have some ideas on how to debug more easily?
Change #1262020 abandoned by Hashar:
[operations/puppet@production] gerrit: adjust idleTimeout on Jetty
Reason:
This made less sense without also adjusting the timeout on the Apache side. Anyway it is no more needed after Ia24797736dc66e1ad0176f96932639e38ea67141 which aligns the Apache mod_proxy idle timeout with the idle timeout from Jetty (unchanged).
It is not an issue with the software (git) being used on the client side (Jenkins) but with the infrastructure. Connections get aborted based on idle timeout, maximum duration or timeout to first response byte and there are four systems in the chain: ATS, Envoy, Apache, Gerrit.
Change #1262020 restored by Arnaudb:
[operations/puppet@production] gerrit: adjust idleTimeout on Jetty
We're wondering if it is possible to have a more verbose output from git, do you know if that is doable with our current stack @hashar?
I've tried to dig in that direction as well, and I found that the failure rate for quibble-composer-mysql-php81-selenium has not significantly increased since we moved Gerrit behind Envoy, even behind CDN. Same thing goes for quibble-with-gated-extensions-vendor-mysql-php83. I have not checked with the other failing jobs added in this ticket. But that does not necessarily mean the problem is not new. So I've investigated a little bit more in the logs. We added Envoy to Gerrit's stack on 2026-03-26:
root@contint1002:/srv/jenkins/builds# fdfind --type f --change-older-than 2026-03-25 '^log$'|wc -l 14803 root@contint1002:/srv/jenkins/builds# fdfind --type f --change-newer-than 2026-03-25 '^log$'|wc -l 49924
In the 14803 files changed before 2026-03-25 there is a single instance of GnuTLS recv error:
for i in $(fdfind --type f --change-older-than 2026-03-25 '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done
mwext-phan-php83/37013/log
root@contint1002:/srv/jenkins/builds# rg "GnuTLS recv error" mwext-phan-php83/37013/log
212: stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CentralAuth/': GnuTLS recv error (-54): Error in the pull function.'
root@contint1002:/srv/jenkins/builds# ls -l mwext-phan-php83/37013/log
-rw-rw-r-- 1 jenkins jenkins 23754 Mar 24 12:32 mwext-phan-php83/37013/logI'll follow up with a suggestion for a config change or another lead if that one does not pan out
Change #1268557 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: disable connection re-use
I forgot to add on my previous comment:
# for i in $(fdfind --type f --change-newer-than 2026-03-25 '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
114There is a clear increase of these errors in the logs on contint, after we introduced Envoy and enabled connection-reuse.
It looks like a logical explanation to that issue. We tried to align timers but some connections are still cut too early or reused after being cut. The recent change should infirm or confirm that idea by simply reproducing the previous situation with no connection reuse. Given Envoy was introduced to address the fact that connection reuse was overwhelming httpd, we might want to discard httpd or Envoy to simplify debugging this if that change confirms connection reuse is the source of these errors.
Change #1268557 merged by Arnaudb:
[operations/puppet@production] gerrit: disable connection re-use
Connection reuse has been disabled again. Please let me know if the 502 errors are still happening.
Change #1268932 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: shorten Envoy upstream idle timeout to 100s
Thanks for the report @ArthurTaylor! It seems connection reuse at ATS level is not the source of our problem.
I'll try to lower Envoy's upstream idle timeout from 120s to 100s. Right now Envoy keeps upstream idle connections for 120s, while httpd keeps them for 125s. That only leaves a 5s margin between the proxy and the backend on that hop. If the remaining GnuTLS recv error (-54) failures are caused by Envoy reusing an httpd connection too close to httpd's keepalive limit, this should reduce or eliminate that race. If that does not pan out I'll try to disable connection reuse on Envoy with profile::tlsproxy::envoy::max_requests: 1, forcing Envoy to recycle connections after a single use.
Change #1268932 merged by Arnaudb:
[operations/puppet@production] gerrit: shorten Envoy upstream idle timeout to 100s
Definitely still happening -- just saw it on https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-selenium-php83/25027/console.
Thanks @DLynch @ArthurTaylor for reporting these builds.
The error rate seems steady:
# CI builds with recv error for the last 24h
root@contint1002:/srv/jenkins/builds# for i in $(fdfind --type file --changed-within 1d '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
43
# CI builds with recv error since idle timeout has been reduced to 100
root@contint1002:/srv/jenkins/builds# for i in $(fdfind --type file --changed-within 15h '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
19I'll test manually first to apply:
to prevent any breakage needing a revert to be fixed. If that does not break anything, I'll let the connection reuse disabled for a while and monitor the resulting error rate.
Change #1269364 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: disable connection reuse on Envoy
A first manual test on gerrit-spare shows no breakage, I will now try to apply that change manually on the primary Gerrit instance before merging.
Transactions: 408 hits Availability: 100.00 %
Change #1269364 merged by Arnaudb:
[operations/puppet@production] gerrit: disable connection reuse on Envoy
Envoy now stops reusing connections to httpd on our Gerrit primary instance. I'll monitor the impact on the GnuTLS recv error rate.
No impact from that change over the past few hours:
root@contint1002:/srv/jenkins/builds# for i in $(fdfind --type file --changed-within 2h '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
7I will keep on digging
Change #1269479 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: disable connection reuse on the httpd → jetty layer
Change #1270951 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] gerrit: access logging with Envoy
Change #1270951 abandoned by Arnaudb:
[operations/puppet@production] gerrit: access logging with Envoy
Verbose logging has been enabled on Envoy on gerrit2003:
curl -i -X POST 'http://localhost:9631/logging?paths=http:debug,router:debug,connection:debug'
The revert curl command is:
curl -i -X POST 'http://localhost:9631/logging?level=info'
Meanwhile I'm monitoring CI jobs throwing GnuTLS recv error on contint1002 to catch the event on Envoy as soon as it happens.
[edit] I managed to get an instance of the error with https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/27310/console but that might not be enough, I'm still monitoring the builds for errors.
quibble-with-gated-extensions-vendor-mysql-php83/27335 also failed during the clone phase with the same error. The failure happens in prepareRepo() / repo.prune() / git remote prune --dry-run origin:
What git sees:
- curl 56 GnuTLS recv error (-54)
- 3 bytes of length header were received
- fatal: expected flush after ref listing which suggests a truncated Git response.
What Envoy sees on the same time window:
- 35/36 complete cleanly (upstream headers complete -> Codec completed encoding stream -> remote close -> SSL shutdown: rc=1).
- The outlier:
- timestamp: 2026-04-15T10:21:33Z
- ConnectionId=10512921
- StreamId=8694626977253921195
- x-request-id=982a33b7-afd8-4661-8c57-215e9263348d
- POST /r/mediawiki/extensions/CommunityConfiguration/git-upload-pack
- For that part, Envoy logs show:
- request body fully received
- upstream response headers 200
- then onAboveWriteBufferHighWatermark
- then Disabling upstream stream due to downstream stream watermark
But no reset whatsoever.
At the next layer, on httpd:
- Access logs contain the same x-request-id=982a33b7-afd8-4661-8c57-215e9263348d with:
- HTTP 200
- content-type=application/x-git-upload-pack-result
- http.response.bytes=3338696
- From httpd's point of view, this looks like a successful response.
- This does not necessarily mean that the client received the full response
The httpd log suggests the response was generated and sent. The Envoy log suggests that the same response then hit downstream write-buffer pressure and does not show a normal completion. So, somewhere in the Apache -> Envoy -> downstream path, something goes wrong. But no "active reset" of the connection. I'll investigate what envoy Envoy shows with watermark/backpressure on responses.
I have tried to increase per_connection_buffer_limit_bytes first to 4MB, then to 16MB to see if the stream watermark messages were a symptoms of the 502s. The message rate has reduced, but 502s are still happening (https://integration.wikimedia.org/ci/view/All/job/quibble-vendor-mysql-php83/lastFailedBuild/console). I'll keep on debugging.
I have tried to limit max_concurrent_streams to 50, still inconclusive for the connection interruption in CI