Page MenuHomePhabricator

gerrit: Add Envoy in Gerrit's stack
Closed, ResolvedPublic

Description

Following up on

It appears that a introducing Envoy in Gerrit's stack could help align our configs with the most common config behind CDN.

We might also want to remove httpd from that stack after this is done, as it will become less relevant.

Related Objects

Event Timeline

Change #1258976 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: add Envoy TLS termination for the CDN path

https://gerrit.wikimedia.org/r/1258976

ABran-WMF changed the task status from Open to In Progress.Mar 23 2026, 11:26 AM
ABran-WMF triaged this task as Medium priority.
ABran-WMF moved this task from Incoming to Backlog on the collaboration-services board.

Agreed, envoy is the standard around here to terminate TLS and we do need TLS termination between ATS and the service.

Instead we should question if httpd can be removed. +1

Change #1258976 merged by Arnaudb:

[operations/puppet@production] gerrit: add Envoy TLS termination for the CDN path

https://gerrit.wikimedia.org/r/1258976

Merging the change on gerrit-spare shows that it works:

arnaudb@gerrit2002:~ $ curl https://localhost:8443 -k ; echo
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://localhost:8443/r/">here</a>.</p>
</body></html>

arnaudb@gerrit2002:~ $ curl https://localhost:8443/r/ -k ; echo
Not Found

I'll apply the change on the 2 other hosts as the puppet-agent was disabled for this.
Then, I'll move on to update the backend mapping for the spare instance.

Change #1259869 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: use Envoy on gerrit-spare

https://gerrit.wikimedia.org/r/1259869

Change #1259869 merged by Arnaudb:

[operations/puppet@production] gerrit: use Envoy on gerrit-spare

https://gerrit.wikimedia.org/r/1259869

Change #1259902 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: use Envoy on gerrit-spare

https://gerrit.wikimedia.org/r/1259902

the first attempt at mapping gerrit-spare to its Envoy endpoint was blocked because the block:

params:
    - '@plugin=/usr/lib/trafficserver/modules/conf_remap.so'
    - '@pparam=proxy.config.ssl.client.CA.cert.filename=/etc/ssl/certs/ISRG_Root_X1.pem'  # TODO

remained active. I'll enable the backend again without these lines to harmonize it with the other similar backends like lists.wm.o.

Change #1259902 merged by Arnaudb:

[operations/puppet@production] gerrit: use Envoy on gerrit-spare

https://gerrit.wikimedia.org/r/1259902

gerrit-spare now uses Envoy to expose its service to the CDN.

Change #1259944 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: use Envoy on gerrit-replica

https://gerrit.wikimedia.org/r/1259944

Change #1259945 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: use Envoy on gerrit

https://gerrit.wikimedia.org/r/1259945

Change #1259944 merged by Arnaudb:

[operations/puppet@production] gerrit: use Envoy on gerrit-replica

https://gerrit.wikimedia.org/r/1259944

gerrit-spare now uses Envoy to expose its service to the CDN.

so does gerrit-replica

ABran-WMF renamed this task from gerrit: Add envoy in Gerrit's stack to gerrit: Add Envoy in Gerrit's stack.Mar 25 2026, 6:28 AM

Change #1259945 merged by Arnaudb:

[operations/puppet@production] gerrit: use Envoy on gerrit

https://gerrit.wikimedia.org/r/1259945

ABran-WMF updated the task description. (Show Details)

gerrit-spare now uses Envoy to expose its service to the CDN.

so does gerrit-replica

so does gerrit, with connection reuse enabled in the same migration.

Reopening since the merged issue is actively ongoing.

We had reports of 502 issues on T420865 (which I have marked as a dupe of this one).

After Gerrit was moved to Envoy I did notice Apache MPM event having a lot of keepalive connections:

gerrit2003_keepalive.png (397×816 px, 69 KB)

The reason is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1259945 (titled "use Envoy on gerrit") which introduced Envoy. Apache kept its 122 seconds timeout.

Later in the afternoon I have discovered the Envoy Telemetry dashboard and went wondering why connections had a maximum of duration of 5 minutes and what could be the panel Destroyed connections with active requests (that is Envoy finding requests it is sending are not honored because the connection got terminated by the remote):

gerrit2003_envoy_connections_length.png (410×912 px, 100 KB)
gerrit2003_envoy_destroyed_connections_with_active_requests.png (426×908 px, 70 KB)

On the panel showing the requests rate per code, ticking the metric local_port_443/502 shows the 502
https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-12h&to=now&timezone=browser&var-datasource=000000026&var-origin=$__all&var-origin_instance=gerrit2003:9631&var-destination=$__all&viewPanel=panel-4

gerrit2003_envoy_upstream_http_codes.png (426×908 px, 38 KB)

I am convinced the issue is Envoy having a 300 seconds timeout (probably the default) while Apache currently has a 122 seconds timeout (set to that when it was attempted to use keepalive connections with ATS).

This can be fixed by lowering Envoy downstream TTL from the default 300 seconds down to 120 seconds (or lower).

Envoy configuration for Gerrit is in Puppet at hieradata/role/common/gerrit.yaml.

It has configuration bits for ATS/CDN:

profile::tlsproxy::envoy::upstream_tls: true
profile::tlsproxy::envoy::upstream_response_timeout: 120.0

We need the counterpart for downstream (Apache) which currently has a timeout of 122. Maybe:

profile::tlsproxy::envoy::downstream_idle_timeout: 110

(I don't know whether that is the proper setting, I only discovered Envoy this morning :-]

Looking directly at the file system, to bypass any possible issues with Hiera or Puppet, I see this:

envoy:

clusters.d/00-cluster_local_port_443.yaml:connect_timeout: 1.0s
envoy.yaml:  - connect_timeout: 1.0s

listeners.d/00-tls_terminator_8443.yaml:              timeout: 120.0s

httpd (apache)

50-gerrit-wikimedia-org.conf:    # Because Gerrit's Jetty has a 30s timeout (httpd.idleTimeout = 30s):
50-gerrit-wikimedia-org.conf:    # MUST be shorter than Gerrit `httpd.idleTimeout`
50-gerrit-wikimedia-org.conf:    ProxyTimeout 25

conf-available/10-connection-reuse.conf:KeepAliveTimeout 122

jetty (gerrit):

# mod_proxy ProxyTimeout. See modules/profile/templates/gerrit/apache.erb
idleTimeout = 30 sec
connectTimeout = 5 s
readTimeout = 5 s
connectTimeout = 30 sec
idleTimeout = 3600 s

So I can confirm the number 122 for the httpd KeepAliveTimeout.

Can also confirm the number 120 for the envoy route timeout.

But we have to differentiate between request timeouts and idle timeouts and these 2 variables above seem to be different things to me.

As the envoy route timeout is not an idle timeout but a request timeout.

The request timeouts should be the longest closest to the user. So envoy the highest number, httpd in the middle and gerrit lowest.

The idle timeouts should be the other way around. The inner-most service should be longer and the one closest to the user the shortest.

Isn't the issue here that the gerrit httpd config has idleTimeout = 30 sec so if an gerrit operation takes more than 30 seconds the connection gets killed while envoy and/or apache are happy to wait > 100 seconds for it?

So.. something in gerrit takes 31 seconds, envoy is patient 110 or 120 or whatever, httpd is waiting 120 or 122 to reuse connections but regardless the gerrit idleTimeout is the bottleneck at 30 ?

If that hypothesis was true the issues started with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1241048/2/modules/gerrit/templates/gerrit.config.erb where idleTimeout in jetty was added and set to 30 sec.

That is much lower than the historic default (infite) or the current default (5m).

It appears that the idleTimeout in jetty only resets if bytes are moved either in or out, making it act as both, request timeout and idle timeout.

from https://jetty.org/docs/jetty/12.1/operations-guide/modules/standard.html#http

The amount of time a connection can be idle (i.e. no bytes received and no bytes sent) until the server decides to close it to save resources — default 30 seconds.

@Dzahn the Jetty timeout is independent to the problem. It is for the Apachemod_proxyJetty chain. We had that issue several years ago and as part of investigating the whole chain of timeout I identified they still rarely occurred T246763#11637703 and follow up. That I think has been solved by having set the mod_proxy timeout to a lower value than Apache (I haven't verified the outcome).

This issue here is for the chain EnvoyApache. We could have raised the Apache KeepAliveTmeout to be above Envoy (eg to 320 seconds) but that would add a lot more open socket to be parked by Apache in keepalive mode. Instead I have picked Envoy timeout to be set to a lower timeout than Apache (eg 110 seconds instead of the default of 300 seconds).

Change #1261932 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: tweak downstream_idle_timeout on Envoy

https://gerrit.wikimedia.org/r/1261932

Change #1261932 merged by Arnaudb:

[operations/puppet@production] gerrit: tweak downstream_idle_timeout on Envoy

https://gerrit.wikimedia.org/r/1261932

profile::tlsproxy::envoy::upstream_tls: true
profile::tlsproxy::envoy::upstream_response_timeout: 120.0

We need the counterpart for downstream (Apache) which currently has a timeout of 122. Maybe:

profile::tlsproxy::envoy::downstream_idle_timeout: 110

Done with that change (and that one), lets see if it fixes that issue

Change #1261933 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: Envoy downstream timeout fix

https://gerrit.wikimedia.org/r/1261933

Change #1261933 merged by Arnaudb:

[operations/puppet@production] gerrit: Envoy downstream timeout fix

https://gerrit.wikimedia.org/r/1261933

Change #1261937 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: tweak envoy::idle_timeout

https://gerrit.wikimedia.org/r/1261937

Change #1261937 merged by Arnaudb:

[operations/puppet@production] gerrit: tweak envoy::idle_timeout

https://gerrit.wikimedia.org/r/1261937

@ABran-WMF I think I have mixed up downstream and `upstream. Envoy terminology states:

Downstream: A downstream host connects to Envoy, sends requests, and receives responses.
Upstream: An upstream host receives connections and requests from Envoy and returns responses.

Which mean in my comment above at T420909#11756101 I have mixed them up :-\

Thus ATS is downstream and Apache is upstream (which is the timeout that needs adjustements) and thus in our context:

  • downstream idle timeout with ATS needs to be increased to be above the 120 ATS timeout
  • upstream idle timeout with Apache needs to be set to a lower value Apache KeepAlivetimeout (122)
- profile::tlsproxy::envoy::downstream_idle_timeout: 110
+ profile::tlsproxy::envoy::downstream_idle_timeout: 122
+ profile::tlsproxy::envoy::downstream_idle_timeout: 120

Change #1261957 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: align ATS/Envoy/Apache timeouts

https://gerrit.wikimedia.org/r/1261957

Change #1261957 merged by Arnaudb:

[operations/puppet@production] gerrit: align ATS/Envoy/Apache timeouts

https://gerrit.wikimedia.org/r/1261957

Change #1262007 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: discard ttl=20 on httpd

https://gerrit.wikimedia.org/r/1262007

Change #1262007 merged by Arnaudb:

[operations/puppet@production] gerrit: discard ttl=20 on httpd

https://gerrit.wikimedia.org/r/1262007

Change #1262020 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: adjust idleTimeout on Jetty

https://gerrit.wikimedia.org/r/1262020

we tweaked several knobs on httpd and Envoy and still have the same underlying issue, I think aligning Jetty with the rest of the timers could yield more results

Change #1266950 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: fix Envoy idle timeout handling for slow HTTPS git request

https://gerrit.wikimedia.org/r/1266950

Change #1266950 merged by Arnaudb:

[operations/puppet@production] gerrit: fix Envoy idle timeout handling for slow HTTPS git requests

https://gerrit.wikimedia.org/r/1266950