Page MenuHomePhabricator

ATS causes git fetches from Gerrit to fail with 502 responses
Closed, ResolvedPublic

Description

There were some reports from @Paladox in IRC and T414719#11590919 and here T414719#11612925 about 502 errors from Gerrit.

We should investigate this and try to reduce the user facing errors.

Relevant logstash query: https://logstash.wikimedia.org/goto/78f43ba4e27abb0fa03ad86fcae79dfc

Previously, Gerrit HTTP 502 errors in CI from 2020-2022: T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Most of those 5xx are with the user-agent git and POST requests to /git-upload-pack: https://logstash.wikimedia.org/goto/0e0744620687562084fd4c5837b37c96

Real browsers generated a hand full of 5xx: https://logstash.wikimedia.org/goto/de750d16aa64457242d13be5c5aee470

I reproduced it from integration-agent-docker-1047.integration.eqiad1.wikimedia.cloud by doing:

  • a clone of mediawiki/extensions.git
  • update the 900+ git submodules over 12 parallel processes
git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/extensions.git
git submodule update --init --depth=3 --jobs=12

And eventually some of them fail with:

error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502
fatal: expected flush after ref listing
fatal: clone of 'https://gerrit.wikimedia.org/r/mediawiki/extensions/AddHTMLMetaAndTitle' into submodule path 'AddHTMLMetaAndTitle' failed

:)

hashar triaged this task as Unbreak Now! priority.Feb 16 2026, 11:12 AM

The Jenkins host is connecting via gerrit-lb.eqiad.wikimedia.org.

20260216.11h36m57s CONNECT: attempt fail [BAD_INCOMING_RESPONSE] to 208.80.153.116:443 for host='gerrit.wikimedia.org' connection_result=Input/output error [5] error=Unknown error 19999 [-19999] attempts=0 url='https://gerrit.discovery.wmnet/r/mediawiki/extensions/AtMentions/info/refs?service=git-upload-pack'

reducing parallelization on

git submodule update --init --depth=3 --jobs=12

to --jobs=2 significantly reduces the 502 rate

From running git submodule update with GIT_CURL_VERBOSE=1 I got one with the curl headers:

11:57:06.636868 http.c:664              == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
11:57:06.636897 http.c:664              == Info: Found bundle for host gerrit.wikimedia.org: 0x56473e985ca0 [can multiplex]
11:57:06.636917 http.c:664              == Info: Re-using existing connection! (#0) with host gerrit.wikimedia.org
11:57:06.636929 http.c:664              == Info: Connected to gerrit.wikimedia.org (208.80.154.225) port 443 (#0)
11:57:06.636974 http.c:664              == Info: Using Stream ID: 5 (easy handle 0x56473e97f290)
11:57:06.637095 http.c:611              => Send header, 0000000297 bytes (0x00000129)
11:57:06.637139 http.c:623              => Send header: POST /r/mediawiki/extensions/QuizGame/git-upload-pack HTTP/2
11:57:06.637146 http.c:623              => Send header: Host: gerrit.wikimedia.org
11:57:06.637151 http.c:623              => Send header: user-agent: git/2.34.1
11:57:06.637156 http.c:623              => Send header: accept-encoding: deflate, gzip, br
11:57:06.637161 http.c:623              => Send header: content-type: application/x-git-upload-pack-request
11:57:06.637166 http.c:623              => Send header: accept: application/x-git-upload-pack-result
11:57:06.637172 http.c:623              => Send header: git-protocol: version=2
11:57:06.637178 http.c:623              => Send header: content-length: 222
11:57:06.637183 http.c:623              => Send header:
11:57:06.637229 http.c:664              == Info: We are completely uploaded and fine
11:57:06.670058 http.c:611              <= Recv header, 0000000013 bytes (0x0000000d)
11:57:06.670070 http.c:623              <= Recv header: HTTP/2 502
11:57:06.670075 http.c:611              <= Recv header, 0000000037 bytes (0x00000025)
11:57:06.670079 http.c:623              <= Recv header: date: Mon, 16 Feb 2026 11:57:06 GMT
11:57:06.670082 http.c:611              <= Recv header, 0000000020 bytes (0x00000014)
11:57:06.670084 http.c:623              <= Recv header: server: ATS/9.2.11
11:57:06.670087 http.c:611              <= Recv header, 0000000025 bytes (0x00000019)
11:57:06.670089 http.c:623              <= Recv header: cache-control: no-store
11:57:06.670092 http.c:611              <= Recv header, 0000000025 bytes (0x00000019)
11:57:06.670094 http.c:623              <= Recv header: content-type: text/html
11:57:06.670097 http.c:611              <= Recv header, 0000000022 bytes (0x00000016)
11:57:06.670100 http.c:623              <= Recv header: content-language: en
11:57:06.670128 http.c:611              <= Recv header, 0000000024 bytes (0x00000018)
11:57:06.670130 http.c:623              <= Recv header: content-encoding: gzip
11:57:06.670133 http.c:611              <= Recv header, 0000000008 bytes (0x00000008)
11:57:06.670135 http.c:623              <= Recv header: age: 0
11:57:06.670138 http.c:611              <= Recv header, 0000000023 bytes (0x00000017)
11:57:06.670140 http.c:623              <= Recv header: vary: Accept-Encoding
11:57:06.670143 http.c:611              <= Recv header, 0000000035 bytes (0x00000023)
11:57:06.670145 http.c:623              <= Recv header: x-cache: cp1110 miss, cp1110 pass
11:57:06.670148 http.c:611              <= Recv header, 0000000022 bytes (0x00000016)
11:57:06.670150 http.c:623              <= Recv header: x-cache-status: pass
11:57:06.670153 http.c:611              <= Recv header, 0000000054 bytes (0x00000036)
11:57:06.670155 http.c:623              <= Recv header: server-timing: cache;desc="pass", host;desc="cp1110"
11:57:06.670158 http.c:611              <= Recv header, 0000000074 bytes (0x0000004a)
11:57:06.670161 http.c:623              <= Recv header: strict-transport-security: max-age=106384710; includeSubDomains; preload
11:57:06.670164 http.c:611              <= Recv header, 0000000216 bytes (0x000000d8)
11:57:06.670168 http.c:623              <= Recv header: report-to: { "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
11:57:06.670171 http.c:611              <= Recv header, 0000000101 bytes (0x00000065)
11:57:06.670173 http.c:623              <= Recv header: nel: { "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}
11:57:06.670177 http.c:611              <= Recv header, 0000000026 bytes (0x0000001a)
11:57:06.670179 http.c:623              <= Recv header: x-client-ip: 185.15.56.1
11:57:06.670181 http.c:611              <= Recv header, 0000000190 bytes (0x000000be)
11:57:06.670192 http.c:623              <= Recv header: set-cookie: WMF-Uniq=XST-2JxMvU_3G6WPx4gisgMJAAAAAFvdgkielnBq2tGNmF2yauLULxX-OtfGdEKQ;Domain=gerrit.wikimedia.org;Path=/;HttpOnly;secure;SameSite=None;Expires=Tue, 16 Feb 2027 00:00:00 GMT
11:57:06.670196 http.c:611              <= Recv header, 0000000052 bytes (0x00000034)
11:57:06.670199 http.c:623              <= Recv header: x-request-id: 79632a2c-bd42-4768-bc29-4e07e45032d6
11:57:06.670202 http.c:611              <= Recv header, 0000000002 bytes (0x00000002)
11:57:06.670210 http.c:623              <= Recv header:
11:57:06.670288 http.c:664              == Info: Connection #0 to host gerrit.wikimedia.org left intact
error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502
fatal: error reading section header 'shallow-info'
fatal: clone of 'https://gerrit.wikimedia.org/r/mediawiki/extensions/QuizGame' into submodule path '/home/hashar/extensions/QuizGame' failed
Failed to clone 'QuizGame'. Retry scheduled

I went to dig the QuizGame failed request at T417536#11619984, git emitted:

11:57:06.431296 http.c:623              => Send header: GET /r/mediawiki/extensions/QuizGame/info/refs?service=git-upload-pack HTTP/2
11:57:06.594999 http.c:623              => Send header: POST /r/mediawiki/extensions/QuizGame/git-upload-pack HTTP/2
11:57:06.637139 http.c:623              => Send header: POST /r/mediawiki/extensions/QuizGame/git-upload-pack HTTP/2

So that is one GET followed by two POST.

In the Gerrit Apache log ( https://logstash.wikimedia.org/app/dashboards#/view/825c5c80-8aef-11eb-8ab2-63c7f3b019fc ) I see:

GET /r/mediawiki/extensions/QuizGame/info/refs?service=git-upload-pack
rsyslog.timereported":"2026-02-16T11:57:06.574811+00:00"
POST /r/mediawiki/extensions/QuizGame/git-upload-pack
"rsyslog.timereported":"2026-02-16T11:57:06.612005+00:00"

The second POST is not seen in the Apache logs which sounds like ATS/Varnish killing it off after failing to reach out Gerrit?

Also from the curl trace:

11:57:06.637229 http.c:664              == Info: We are completely uploaded and fine
11:57:06.670058 http.c:611              <= Recv header, 0000000013 bytes (0x0000000d)

So that is rejected after ~ 33 ms if I get it right.

@Vgutierrez could it be ATS/Varnish has a limit to the number of parallel connections it make to a single backend and would happen to reach that from time to time? Then I imagine you would have seen such errors ;)

Just noting I've seen this issue be triggered this issue 4 times on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1239629 already, preventing the merge 3 times

I can not reproduce the issue when I bypass the ATS/Varnish cache by cloning directly from the Gerrit host.

I went to use a WMCS instance, clone mediawiki/extensions which has 942 git submodules then fetch them in parallel accross 16 processes. I have used integration-agent-docker-1047.integration.eqiad1.wikimedia.cloud with IP 172.17.3.56, but any instance would do).

git clone https://gerrit.wikimedia.org/r/mediawiki/extensions.git --depth=1
git -C extensions -c http.sslVerify=false -c url."https://gerrit.discovery.wmnet/r/".insteadOf="https://gerrit.wikimedia.org/r/" submodule update --depth=1 --init --jobs=16
Submodule  ... registered for path ...
Submodule  ... registered for path ...
...
Cloning into ...
Cloning into ...
...
Submodule path ...: checked out ...
<Control + C later to bypass the remaining of the checkout phase>

I have repeated it multiple and never triggered the issue.

If I clone through ATS/Varnish it does trigger on some repo every time I have tried:

git clone https://gerrit.wikimedia.org/r/mediawiki/extensions.git --depth=1
git -C extensions submodule update --depth=1 --init --jobs=16

Cloning into '/home/hashar/extensions/HostStats'...
error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502
fatal: error reading section header 'shallow-info'
fatal: clone of 'https://gerrit.wikimedia.org/r/mediawiki/extensions/HostStats' into submodule path '/home/hashar/extensions/HostStats' failed
Failed to clone 'HostStats'. Retry scheduled

Cloning into '/home/hashar/extensions/SubpageFun'...
error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502
fatal: expected flush after ref listing
fatal: clone of 'https://gerrit.wikimedia.org/r/mediawiki/extensions/SubpageFun' into submodule path '/home/hashar/extensions/SubpageFun' failed
Failed to clone 'SubpageFun'. Retry scheduled

The 502 started happening last week after Gerrit was moved behind the ATS/Varnish, it seems to be happening more as a result of switching Gerrit from eqiad to codfw. The added latency between the datacenter might make cause the underlying issue to happen more frequently.

Given this is causing some havoc on the CI jobs, maybe we can consider rolling back the switch behind CDN (T411895) and move gerrit.wikimedia.org back to directly connect to the services IP?

The service is now on gerrit2003 and the service IPs would be 2620:0:860:4:208:80:153:116 and 208.80.153.116.

Just noting I've seen this issue be triggered this issue 4 times on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/1239629 already, preventing the merge 3 times

After having timed a retry when nothing else was pulling from gerrit, it seems that the failure is consistent in CheckUser and so it's blocking all merges in that extension

I can confirm there were occasional 502 since the switch behind the CDN last week in T411895.

Given this is causing some havoc on the CI jobs, maybe we can consider rolling back the switch behind CDN (T411895) and move gerrit.wikimedia.org back to directly connect to the services IP?

I'd like to avoid switching all of that traffic back just because a fraction of that traffic is hitting some edge case errors. Also this makes further troubleshooting harder.

Is it possible to change the zuul config and use less parallel jobs? In our meeting we explored less aggressive settings do not produces 502s. Maybe it's also possible to properly retry if one git operation fails?
It would be possible to change the /etc/hosts entry to the old address though for troubleshooting.

I'd like to avoid switching all of that traffic back just because a fraction of that traffic is hitting some edge case errors. Also this makes further troubleshooting harder.

I agree it is potentially rare, though I haven't looked at the build logs, but it definitely has a snowballing effect. Each change made to MediaWiki repositories triggers multiple jobs, each of them clone several repositories there is thus a lot of git operations happening and if any fail the whole set of jobs would have to run again. Developers can retry but is not a great experience given some jobs take 22 minutes and CI only report after it has completed.

I think the troubleshooting can still be done by tricking git into reaching the load balancer.

I thought about using url."https://gerrit-lb.eqiad.wikimedia.org/r/".insteadOf="https://gerrit.wikimedia.org/r/" but the gerrit-lb entry yields a 400. However git has a way to hijack curl hostname resolution with http.curloptResolve which can be made to have gerrit.wikimedia.org to resolve to the IP of the load balancer:

git clone https://gerrit.wikimedia.org/r/mediawiki/extensions.git --depth=1
git -C extensions -c http.curloptResolve='gerrit.wikimedia.org:443:208.80.154.225' submodule update --depth=1 --init --jobs=16

Is it possible to change the zuul config and use less parallel jobs? In our meeting we explored less aggressive settings do not produces 502s.

Not really, Zuul has a lot of parallelism itself and we can't quite change that. The Quibble jobs can be made to have less parallelism at the price of taking a lot more time to clone/fetch the repo (iirc the total cumulative time can reach more than 15 minutes T374717).

Maybe it's also possible to properly retry if one git operation fails?

Maybe, but I don't think git has such as option, I can dig in its source code though.

I am seeing the same error on https://gerrit.wikimedia.org/r/c/mediawiki/tools/api-testing/+/1235006. No submodules involved.

Zuul clone 
16:23:42 INFO:quibble.commands:>>> Start: Zuul clone {"cache_dir": "/srv/git", "projects": ["mediawiki/core", "mediawiki/skins/Vector", "mediawiki/tools/api-testing", "mediawiki/vendor"], "workers": 8, "workspace": "/workspace/src", "zuul_branch": "master", "zuul_project": "mediawiki/tools/api-testing", "zuul_ref": "refs/zuul/master/Z1767d32d3de74390a6a6dc9efe9c7da6", "zuul_url": "git://contint1002.wikimedia.org"}
16:23:42 INFO:zuul.CloneMapper:Workspace path set to: /workspace/src
16:23:42 INFO:zuul.CloneMapper:Mapping projects to workspace...
16:23:42 INFO:zuul.CloneMapper:  mediawiki/core -> /workspace/src
16:23:42 INFO:zuul.CloneMapper:  mediawiki/skins/Vector -> /workspace/src/skins/Vector
16:23:42 INFO:zuul.CloneMapper:  mediawiki/tools/api-testing -> /workspace/src/mediawiki/tools/api-testing
16:23:42 INFO:zuul.CloneMapper:  mediawiki/vendor -> /workspace/src/vendor
16:23:42 INFO:zuul.CloneMapper:Expansion completed.
16:23:42 INFO:quibble.zuul.clone:Preparing 4 repositories with 8 workers
16:23:42 INFO:quibble.zuul.clone:Cloning mediawiki/core first
16:23:42 INFO:zuul.Cloner:Creating repo mediawiki/core from cache /srv/git/mediawiki/core.git
16:23:42 2026-02-16 15:23:42,979 INFO spawned: 'apache' with pid 58
16:23:42 2026-02-16 15:23:42,985 INFO spawned: 'memcached' with pid 59
16:23:42 2026-02-16 15:23:42,990 INFO spawned: 'php-fpm' with pid 60
16:23:43 2026-02-16 15:23:43,033 INFO success: php-fpm entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
16:23:44 2026-02-16 15:23:44,056 INFO success: apache entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
16:23:44 2026-02-16 15:23:44,056 INFO success: memcached entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
16:23:48 INFO:zuul.Cloner:Updating origin remote in repo mediawiki/core to https://gerrit.wikimedia.org/r/mediawiki/core
16:23:52 INFO:zuul.Cloner:upstream repo has branch master
16:23:52 INFO:zuul.Cloner:Falling back to branch master
16:23:53 INFO:zuul.Cloner:Prepared mediawiki/core repo with branch master at commit 39b252fe4f0ea818698ef6dcde81c99b15377e90
16:23:53 INFO:zuul.Cloner.mediawiki/skins/Vector:Creating repo mediawiki/skins/Vector from cache /srv/git/mediawiki/skins/Vector.git
16:23:53 INFO:zuul.Cloner.mediawiki/vendor:Creating repo mediawiki/vendor from cache /srv/git/mediawiki/vendor.git
16:23:53 INFO:zuul.Cloner.mediawiki/tools/api-testing:Creating repo mediawiki/tools/api-testing from upstream https://gerrit.wikimedia.org/r/mediawiki/tools/api-testing
16:23:53 INFO:zuul.Cloner.mediawiki/skins/Vector:Updating origin remote in repo mediawiki/skins/Vector to https://gerrit.wikimedia.org/r/mediawiki/skins/Vector
16:23:55 INFO:zuul.Cloner.mediawiki/skins/Vector:upstream repo has branch master
16:23:56 INFO:zuul.Cloner.mediawiki/skins/Vector:Falling back to branch master
16:23:56 INFO:zuul.Cloner.mediawiki/skins/Vector:Prepared mediawiki/skins/Vector repo with branch master at commit d9ae190207624d534bfefa0ee438a7da28d417a3
16:23:56 INFO:zuul.Cloner.mediawiki/tools/api-testing:upstream repo has branch master
16:23:57 INFO:zuul.Cloner.mediawiki/tools/api-testing:Prepared mediawiki/tools/api-testing repo with commit dabfb48dc26f4ee282b2eac7da9a6df0c73202a7
16:23:57 INFO:zuul.Cloner.mediawiki/vendor:Updating origin remote in repo mediawiki/vendor to https://gerrit.wikimedia.org/r/mediawiki/vendor
16:23:57 INFO:quibble.commands:<<< Finish: Zuul clone {"cache_dir": "/srv/git", "projects": ["mediawiki/core", "mediawiki/skins/Vector", "mediawiki/tools/api-testing", "mediawiki/vendor"], "workers": 8, "workspace": "/workspace/src", "zuul_branch": "master", "zuul_project": "mediawiki/tools/api-testing", "zuul_ref": "refs/zuul/master/Z1767d32d3de74390a6a6dc9efe9c7da6", "zuul_url": "git://contint1002.wikimedia.org"}, in 15.151 s
16:23:57 Traceback (most recent call last):
16:23:57   File "/usr/local/bin/quibble", line 7, in <module>
16:23:57     sys.exit(main())
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/cmd.py", line 1011, in main
16:23:57     cmd.execute(
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/cmd.py", line 633, in execute
16:23:57     quibble.commands.execute_command(command)
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/commands.py", line 33, in execute_command
16:23:57     command.execute()
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/commands.py", line 166, in execute
16:23:57     quibble.zuul.clone(
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/zuul.py", line 109, in clone
16:23:57     future.result()
16:23:57   File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result
16:23:57     return self.__get_result()
16:23:57   File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result
16:23:57     raise self._exception
16:23:57   File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
16:23:57     result = self.fn(*self.args, **self.kwargs)
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/zuul.py", line 126, in _clone_worker
16:23:57     raise e
16:23:57   File "/usr/local/lib/python3.9/dist-packages/quibble/zuul.py", line 122, in _clone_worker
16:23:57     project_cloner.prepareRepo(project, dest)
16:23:57   File "/usr/local/lib/python3.9/dist-packages/zuul/lib/cloner.py", line 167, in prepareRepo
16:23:57     repo.reset()
16:23:57   File "/usr/local/lib/python3.9/dist-packages/zuul/merger/merger.py", line 99, in reset
16:23:57     self.update()
16:23:57   File "/usr/local/lib/python3.9/dist-packages/zuul/merger/merger.py", line 205, in update
16:23:57     origin.fetch(tags=True, force=True)
16:23:57   File "/usr/lib/python3/dist-packages/git/remote.py", line 831, in fetch
16:23:57     res = self._get_fetch_info_from_stderr(proc, progress)
16:23:57   File "/usr/lib/python3/dist-packages/git/remote.py", line 698, in _get_fetch_info_from_stderr
16:23:57     proc.wait(stderr=stderr_text)
16:23:57   File "/usr/lib/python3/dist-packages/git/cmd.py", line 455, in wait
16:23:57     raise GitCommandError(self.args, status, errstr)
16:23:57 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
16:23:57   cmdline: git fetch --force --tags -v -- origin
16:23:57   stderr: 'error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502
16:23:57 fatal: expected flush after ref listing'
16:23:59 Build step 'Execute shell' marked build as failure
16:23:59 [PostBuildScript] - [INFO] Executing post build scripts.
16:23:59 [api-testing-mysql-php83] $ /bin/bash -xe /tmp/jenkins7898173523843691549.sh
16:23:59 + find log/ -name 'mw-debug-*.log' -exec gzip '{}' +

I managed to trigger this while capturing the traffic between ATS and gerrit2003, in my run it failed fetching https://gerrit.wikimedia.org/r/mediawiki/extensions/RelatedArticles, this is the content of the offending request:

POST /r/mediawiki/extensions/RelatedArticles/git-upload-pack HTTP/1.1
user-agent: git/2.30.2
content-type: application/x-git-upload-pack-request
accept: application/x-git-upload-pack-result
git-protocol: version=2
content-length: 222
x-client-ip: 2a02:ec80:700:ed1a::2
x-client-port: 23143
x-forwarded-proto: https
x-connection-properties: H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_128_GCM_SHA256; EC=UNKNOWN; KA=ECDSA;
[REDACTED HTTP HEADERS, NOT RELEVANT]
X-Forwarded-For: 2a02:ec80:700:ed1a::2, 10.140.0.3
via-nginx: 1
Host: gerrit.wikimedia.org
X-CDIS: pass
X-Varnish: 640677633

0011command=fetch0014agent=git/2.30.20001000dthin-pack000fno-progress000finclude-tag000dofs-delta000cdeepen 10032want 9ba3285e92bf25d10358b00f1bac91fa7bee343a
0032want 9ba3285e92bf25d10358b00f1bac91fa7bee343a
0009done
0000

Gerrit answer to this request is a TLS Close Notify packet to close the connection (packet 97936 contains the request I've pasted above):

No.	Time	Source	Destination	Protocol	Length	Info
97936	52.474042	10.140.0.3	208.80.153.116	Git	310	Git Smart Protocol
98238	52.567512	208.80.153.116	10.140.0.3	TLSv1.3	90	Alert (Level: Warning, Description: Close Notify)
98239	52.567544	10.140.0.3	208.80.153.116	TCP	66	17515 → 443 [ACK] Seq=32933 Ack=2148210 Win=947200 Len=0 TSval=4121927970 TSecr=1126329970
98240	52.567549	208.80.153.116	10.140.0.3	TCP	66	443 → 17515 [FIN, ACK] Seq=2148210 Ack=31897 Win=42496 Len=0 TSval=1126329970 TSecr=4121922969
98241	52.567906	10.140.0.3	208.80.153.116	TCP	66	17515 → 443 [FIN, ACK] Seq=32933 Ack=2148211 Win=947200 Len=0 TSval=4121927970 TSecr=1126329970
98346	52.607007	208.80.153.116	10.140.0.3	TCP	66	443 → 17515 [ACK] Seq=2148211 Ack=32933 Win=42496 Len=0 TSval=1126330009 TSecr=4121927875
98527	52.700782	208.80.153.116	10.140.0.3	TCP	66	443 → 17515 [ACK] Seq=2148211 Ack=32934 Win=42496 Len=0 TSval=1126330103 TSecr=4121927970

Just to support Jelto's comment above, I would rather avoid rolling back the entire CDN change unless we get to the point where this is consistently blocking critical workflows. This way we'll get some more time with real life conditions to troubleshoot the problem.

I'm getting a *lot* of these trying to prepare Parsoid's weekly release for the train. If you need to reproduce it, any one of half of dozen of my patches will do.

The frequency is really making jenkins completely unusable for me today. Only one in five or so of my jobs will complete without a spurious failure. I'm trying to get our weekly release on the train, and this is *really* annoying.

Four failures during merge on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1193553
Two failures before successful merge of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1193910/13
Two failures before successful merge of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239768/3
Two failures before successful merge of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1203255

Nine failures on ParserOutput: move flags into $mFlags array (1193553)

Eight failures on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1226332/7

etc

Last night I got pinged because of CI not reacting and that matches the report made on Sunday at T417497. Very early this morning I went to dig into the Zuul and Gerrit log, and surely it keeps disconnecting and thus missing events.

We need to rollback the move behind the TCP proxy that is causing CI jobs to fail even on a single clone and CI is missing events.

Change #1239872 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: bump Jetty threads

https://gerrit.wikimedia.org/r/1239872

Change #1239878 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/dns@master] wikimedia: revert gerrit behind the CDN

https://gerrit.wikimedia.org/r/1239878

Change #1239881 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gerrit: disable nftables throttling

https://gerrit.wikimedia.org/r/1239881

Change #1239881 merged by Jelto:

[operations/puppet@production] gerrit: disable nftables throttling

https://gerrit.wikimedia.org/r/1239881

Change #1239888 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Disable connection re-use for gerrit

https://gerrit.wikimedia.org/r/1239888

Change #1239888 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Disable connection re-use for gerrit

https://gerrit.wikimedia.org/r/1239888

@Vgutierrez suggested this issue might be related to the aggressive connection reuse from ats. Gerrit might not be handling this properly and closes connections unexpectedly.

For troubleshooting connection re-use was disabled temporarily for the gerrit backend in ats in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1239888 (thanks @Vgutierrez !). After deploying this on all cp nodes we were not able to reproduce 502 errors, neither from cp servers, nor from wmcs. From my home connections I get 429s at some point which is kind of expected.

I'll monitor the CI jobs and metrics for some more time and adjust the severity from UBN.

So if this solves the issue of broken CI this gives us more time to find the correct setting in gerrits apache, jetty or somewhere in Gerrit.

Jelto lowered the priority of this task from Unbreak Now! to High.Feb 17 2026, 11:21 AM

No more 502s appeared so far, I'll reduce the severity.

Feel free to bump it to UBN again if we missed something. Finding failed jobs on Gerrit is a bit tricky for me.

That is great thank you :-]

I am wondering though why reusing connections leads to errors. Apache has the default Debian configuration as far as I can tell:

# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to "Off" to deactivate.
#
KeepAlive On

# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited amount.
# We recommend you leave this number high, for maximum performance.
#
MaxKeepAliveRequests 100

# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
#
KeepAliveTimeout 5

Can it be that ATS has a much longer timeout than the Apache 5 seconds timeout? If so i imagine:

  • ATS picks up a connection that it considers not having timed out (idling less than whatever timeout it has) but is about to expire on the Apache side (it has been idle for almost 5 seconds).
  • ATS sends the headers and at the same time, because the connection has been idling for 5 seconds Apache terminates it (tcp FIN?)
  • ATS gets the unexpected termination, gives up and emit a 502 response

Jelto asked how we can find whether CI jobs are still failing. Jenkins store the console output contint.wikimedia.org under /srv/jenkins/builds/ as files named log (one per build).

I went to grep through the last four hours of builds (which is after Valentin disabled the connection reuse). There are 2350 of them and the grep yields nothing:

contint1002$ find /srv/jenkins/builds/  -type f -name log -cmin -240 -exec grep -l 'HTTP 502 curl 22' {} \+
contint1002$

That is great thank you :-]

I am wondering though why reusing connections leads to errors. Apache has the default Debian configuration as far as I can tell:

# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to "Off" to deactivate.
#
KeepAlive On

# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited amount.
# We recommend you leave this number high, for maximum performance.
#
MaxKeepAliveRequests 100

# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
#
KeepAliveTimeout 5

Can it be that ATS has a much longer timeout than the Apache 5 seconds timeout? If so i imagine:

  • ATS picks up a connection that it considers not having timed out (idling less than whatever timeout it has) but is about to expire on the Apache side (it has been idle for almost 5 seconds).
  • ATS sends the headers and at the same time, because the connection has been idling for 5 seconds Apache terminates it (tcp FIN?)
  • ATS gets the unexpected termination, gives up and emit a 502 response

Yes.. KeepAliveTimeout needs to be synced with ATS value (120s) and probably MaxKeepAliveRequests needs to be set to 0. You probably want to do this after moving away from public IPs for gerrit backends or at least after closing port 443 to the Internet

Change #1240197 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: adapt httpd config to ATS

https://gerrit.wikimedia.org/r/1240197

Change #1239878 abandoned by Jelto:

[operations/dns@master] wikimedia: revert gerrit behind the CDN

Reason:

not needed anymore, temporary workaround was found

https://gerrit.wikimedia.org/r/1239878

Change #1240294 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: add gerrit-replica to LVS

https://gerrit.wikimedia.org/r/1240294

Change #1240603 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: add gerrit-replica backend to LVS

https://gerrit.wikimedia.org/r/1240603

hashar renamed this task from Investigate gerrit 5xx responses to ATS causes git fetches from Gerrit to fail with 502 responses.Feb 19 2026, 2:41 PM
hashar assigned this task to Vgutierrez.

I can confirm the issue has been fixed. There has been no 502 issue over the last two days:

contint1002$ find /srv/jenkins/builds/  -type f -name log -ctime -2 -exec grep -l 'HTTP 502 curl 22' {} \+
contint1002$

I am thus inclined to mark this task Resolved.

@Vgutierrez mentioned tuning the Apache KeepAliveTimeout / MaxKeepAliveRequests should be tuned to match ATS, which I guess can be addressed in a follow up task.

Summary

The issue has been solved by disabling connection reuse for Gerrit on the ATS. I wrote an explanation of the problem (T417536#11622959) and Valentin mentioned a follow up is to change the Apache 2 timeout (T417536#11624330).

I have filed the follow action as T417998: ATS: align ATS and Gerrit Apache timeouts to reenable connection re-use.

Change #1239872 abandoned by Hashar:

[operations/puppet@production] gerrit: bump Jetty threads

Reason:

The 502 got resolved, that was due to a timeout race condition. The Jetty config can probably be tuned in sync with the Apache frontend and maybe we find a way to have some more monitoring but AFAIK it is not having any troubles.

https://gerrit.wikimedia.org/r/1239872