Page MenuHomePhabricator

Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos
Open, HighPublic

Description

Started seeing this today, example URLs from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/570730#message-756e21dd45f85c38d9b21ff3e774d66ded724927:

12:45:36 INFO:zuul.Cloner.mediawiki/extensions/AbuseFilter:Updating origin remote in repo mediawiki/extensions/AbuseFilter to https://gerrit.wikimedia.org/r/mediawiki/extensions/AbuseFilter
12:45:37 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof
12:45:37 Traceback (most recent call last):
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__
12:45:37     self._ensure_cloned()
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
12:45:37     git.Repo.clone_from(self.remote_url, self.local_path)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from
12:45:37     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone
12:45:37     finalize_process(proc, stderr=stderr)
12:45:37   File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process
12:45:37     proc.wait(**kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait
12:45:37     raise GitCommandError(self.args, status, errstr)
12:45:37 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
12:45:37   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof /workspace/src/extensions/AntiSpoof
12:45:37   stderr: 'Cloning into '/workspace/src/extensions/AntiSpoof'...
12:45:37 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof/': The requested URL returned error: 502
12:45:37 '
12:45:37 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/CiteThisPage
12:45:37 Traceback (most recent call last):
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__
12:45:37     self._ensure_cloned()
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
12:45:37     git.Repo.clone_from(self.remote_url, self.local_path)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from
12:45:37     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone
12:45:37     finalize_process(proc, stderr=stderr)
12:45:37   File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process
12:45:37     proc.wait(**kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait
12:45:37     raise GitCommandError(self.args, status, errstr)
12:45:37 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
12:45:37   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/extensions/CiteThisPage /workspace/src/extensions/CiteThisPage
12:45:37   stderr: 'Cloning into '/workspace/src/extensions/CiteThisPage'...
12:45:37 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CiteThisPage/': The requested URL returned error: 502
12:45:37 '
12:45:37 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser
12:45:37 Traceback (most recent call last):
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__
12:45:37     self._ensure_cloned()
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
12:45:37     git.Repo.clone_from(self.remote_url, self.local_path)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from
12:45:37     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone
12:45:37     finalize_process(proc, stderr=stderr)
12:45:37   File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process
12:45:37     proc.wait(**kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait
12:45:37     raise GitCommandError(self.args, status, errstr)
12:45:37 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
12:45:37   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser /workspace/src/extensions/CheckUser
12:45:37   stderr: 'Cloning into '/workspace/src/extensions/CheckUser'...
12:45:37 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser/': The requested URL returned error: 502
12:45:37 '

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
reedy@deploy1001:/srv/mediawiki-staging$ git pull
fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/mediawiki-config/': The requested URL returned error: 502
reedy@deploy1001:/srv/mediawiki-staging$ git pull
fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/mediawiki-config/': The requested URL returned error: 502

I see this in the apache logs:

$ zgrep -P 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log.1 | grep -i mediawiki-config
2020-04-19T16:17:22     204716  <IPv6>      proxy-server/502        341     GET     http://gerrit.wikimedia.org/r/operations/mediawiki-config/info/refs?service=git-upload-pack    -       text/html       -       -       git/2.11.0      en-US, *;q=0.9  -       -       -       <IPv6> 

I see this in the apache error logs:

$ zgrep -P 'Apr\s+19\s+16:17' /var/log/apache2/gerrit.wikimedia.org.https.error.log.1
[Sun Apr 19 16:17:22.401452 2020] [proxy_http:error] [pid 30708:tid 140482634032896] (20014)Internal error (specific information not available): [client <IPv6>] AH01102: error reading status line from remote server 127.0.0.1:8080
[Sun Apr 19 16:17:22.401502 2020] [proxy:error] [pid 30708:tid 140482634032896] [client <IPv6>] AH00898: Error reading from remote server returned by /r/operations/mediawiki-config/info/refs

I don't see anything that minute from gerrit:

$ zgrep -P '2020-04-19\s+16:17' /var/log/gerrit/gerrit.log.2020-04-19
$ 

Most errors reported in apache status line errors:

$ grep 'proxy_http:error' /var/log/apache2/gerrit.wikimedia.org.https.error.log | grep -vc 'status line'
5
$ grep 'proxy_http:error' /var/log/apache2/gerrit.wikimedia.org.https.error.log | grep -c 'status line'
59

Change 591304 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: add Keepalive=on to ProxyPass config lines

https://gerrit.wikimedia.org/r/591304

I found https://serverfault.com/questions/185894/proxy-error-502-reason-error-reading-from-remote-server-with-apache-2-2-3-de where the reporter is also using httpd to (reverse)-proxy to jetty and he ran into the same 502s - error reading from remote server".

He also mentions trying the exact same changes as suggested by Tyler above (SetEnv proxy-initial-not-pooled, keepalive Off ) but says they did not help.

Then he goes on saying he solved the issue by adding "keepalive=On" to the ProxyPass lines.

Since, as described above, we want to prevent that the backend server closes the pooled connection early, before data sent by the proxy reached the backend, it does seem to make sense to me that we want Keepalive to be turned ON and not off.

That made me create the patch above. Does it make sense?

Change 591304 merged by Dzahn:
[operations/puppet@production] gerrit: add Keepalive=on to ProxyPass config lines

https://gerrit.wikimedia.org/r/591304

Deployed the change above and restarted httpd on gerrit1001.

Deployed the change above and restarted httpd on gerrit1001.

Awesome! Thanks, I'll monitor traffic.

thcipriani reassigned this task from thcipriani to Dzahn.

502s seem to have dropped off quite a bit. Optimistically closing once again.

$ zgrep -cP 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log*
/var/log/apache2/gerrit.wikimedia.org.https.access.log:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.1:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.2.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.3.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.4.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.5.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.6.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.7.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.8.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.9.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.10.gz:63
/var/log/apache2/gerrit.wikimedia.org.https.access.log.11.gz:51
/var/log/apache2/gerrit.wikimedia.org.https.access.log.12.gz:9
/var/log/apache2/gerrit.wikimedia.org.https.access.log.13.gz:23
/var/log/apache2/gerrit.wikimedia.org.https.access.log.14.gz:28
/var/log/apache2/gerrit.wikimedia.org.https.access.log.15.gz:33
/var/log/apache2/gerrit.wikimedia.org.https.access.log.16.gz:26
/var/log/apache2/gerrit.wikimedia.org.https.access.log.17.gz:32
/var/log/apache2/gerrit.wikimedia.org.https.access.log.18.gz:17
/var/log/apache2/gerrit.wikimedia.org.https.access.log.19.gz:4
/var/log/apache2/gerrit.wikimedia.org.https.access.log.20.gz:8
/var/log/apache2/gerrit.wikimedia.org.https.access.log.21.gz:12
/var/log/apache2/gerrit.wikimedia.org.https.access.log.22.gz:5
/var/log/apache2/gerrit.wikimedia.org.https.access.log.23.gz:3
/var/log/apache2/gerrit.wikimedia.org.https.access.log.24.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.25.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.26.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.27.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.28.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.29.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.30.gz:1

Change 579601 abandoned by Thcipriani:
Gerrit: apache proxy not pooled

Reason:
superseded by https://gerrit.wikimedia.org/r/591304

https://gerrit.wikimedia.org/r/579601

Now they’re HTTP 500 errors (presumably due to T254158: Gerrit 3.2 upgrade): Jenkins build

12:29:30 + git fetch --quiet --update-head-ok --depth 2 git://contint2001.wikimedia.org/mediawiki/extensions/Wikibase +refs/zuul/master/Z7f6f22c8417c4636b6c2caec7666f5c1:refs/zuul/master/Z7f6f22c8417c4636b6c2caec7666f5c1
12:29:31 + git checkout -B master FETCH_HEAD
12:29:31 Reset branch 'master'
12:29:31 + set +x
12:29:31 + git submodule --quiet update --init --recursive
12:29:31 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-data-model.git/': The requested URL returned error: 500
12:29:31 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-data-model.git' into submodule path '/src/view/lib/wikibase-data-model' failed
12:29:31 Failed to clone 'view/lib/wikibase-data-model'. Retry scheduled
12:29:31 fatal: unable to access 'https://phabricator.wikimedia.org/source/datavalues-javascript.git/': The requested URL returned error: 500
12:29:31 fatal: clone of 'https://phabricator.wikimedia.org/source/datavalues-javascript.git' into submodule path '/src/view/lib/wikibase-data-values' failed
12:29:31 Failed to clone 'view/lib/wikibase-data-values'. Retry scheduled
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-serialization.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-serialization.git' into submodule path '/src/view/lib/wikibase-serialization' failed
12:29:32 Failed to clone 'view/lib/wikibase-serialization'. Retry scheduled
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-termbox/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-termbox' into submodule path '/src/view/lib/wikibase-termbox' failed
12:29:32 Failed to clone 'view/lib/wikibase-termbox'. Retry scheduled
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/datavalues-javascript.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/datavalues-javascript.git' into submodule path '/src/view/lib/wikibase-data-values' failed
12:29:32 Failed to clone 'view/lib/wikibase-data-values' a second time, aborting
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-data-model.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-data-model.git' into submodule path '/src/view/lib/wikibase-data-model' failed
12:29:32 Failed to clone 'view/lib/wikibase-data-model' a second time, aborting
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-serialization.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-serialization.git' into submodule path '/src/view/lib/wikibase-serialization' failed
12:29:32 Failed to clone 'view/lib/wikibase-serialization' a second time, aborting
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-termbox/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-termbox' into submodule path '/src/view/lib/wikibase-termbox' failed
12:29:32 Failed to clone 'view/lib/wikibase-termbox' a second time, aborting
12:29:32 Build step 'Execute shell' marked build as failure
12:29:32 Archiving artifacts

@Lucas_Werkmeister_WMDE those errors come from cloning from Phabricator. I would recommend to instead use the canonical Gerrit URL.

Still seen:
https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/53486/console

02:18:34 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/FileImporter/': The requested URL returned error: 502

Still seen:
https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/78885/console

15:40:27   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/skins/Vector/': The requested URL returned error: 502'
hashar added a subscriber: zeljkofilipin.

@zeljkofilipin reported a similar one using:

GIT_TRACE_CURL=1 GIT_TRACE_CURL_NO_DATA=1 git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/core.git mediawiki

1~$ GIT_TRACE_CURL=1 GIT_TRACE_CURL_NO_DATA=1 git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/core.git mediawiki
2Cloning into 'mediawiki'...
318:11:59.377481 http.c:731 == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
418:11:59.535149 http.c:731 == Info: Trying 208.80.154.137...
518:11:59.535178 http.c:731 == Info: TCP_NODELAY set
618:11:59.679604 http.c:731 == Info: Connected to gerrit.wikimedia.org (208.80.154.137) port 443 (#0)
718:11:59.679739 http.c:731 == Info: ALPN, offering h2
818:11:59.679748 http.c:731 == Info: ALPN, offering http/1.1
918:11:59.687143 http.c:731 == Info: successfully set certificate verify locations:
1018:11:59.687156 http.c:731 == Info: CAfile: /etc/ssl/cert.pem
11 CApath: none
1218:11:59.687245 http.c:731 == Info: TLSv1.2 (OUT), TLS handshake, Client hello (1):
1318:11:59.828884 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Server hello (2):
1418:11:59.829086 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Certificate (11):
1518:11:59.849901 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Server key exchange (12):
1618:11:59.850401 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Server finished (14):
1718:11:59.850696 http.c:731 == Info: TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
1818:11:59.850705 http.c:731 == Info: TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
1918:11:59.850742 http.c:731 == Info: TLSv1.2 (OUT), TLS handshake, Finished (20):
2018:11:59.994151 http.c:731 == Info: TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
2118:11:59.994282 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Finished (20):
2218:11:59.994311 http.c:731 == Info: SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
2318:11:59.994325 http.c:731 == Info: ALPN, server accepted to use http/1.1
2418:11:59.994331 http.c:731 == Info: Server certificate:
2518:11:59.994342 http.c:731 == Info: subject: CN=gerrit.wikimedia.org
2618:11:59.994348 http.c:731 == Info: start date: Aug 31 17:00:16 2021 GMT
2718:11:59.994353 http.c:731 == Info: expire date: Nov 29 17:00:15 2021 GMT
2818:11:59.994369 http.c:731 == Info: subjectAltName: host "gerrit.wikimedia.org" matched cert's "gerrit.wikimedia.org"
2918:11:59.994379 http.c:731 == Info: issuer: C=US; O=Let's Encrypt; CN=R3
3018:11:59.994383 http.c:731 == Info: SSL certificate verify ok.
3118:11:59.994480 http.c:678 => Send header, 0000000228 bytes (0x000000e4)
3218:11:59.994489 http.c:690 => Send header: GET /r/mediawiki/core.git/info/refs?service=git-upload-pack HTTP/1.1
3318:11:59.994493 http.c:690 => Send header: Host: gerrit.wikimedia.org
3418:11:59.994496 http.c:690 => Send header: User-Agent: git/2.24.3 (Apple Git-128)
3518:11:59.994500 http.c:690 => Send header: Accept: */*
3618:11:59.994503 http.c:690 => Send header: Accept-Encoding: deflate, gzip
3718:11:59.994506 http.c:690 => Send header: Pragma: no-cache
3818:11:59.994509 http.c:690 => Send header: Git-Protocol: version=2
3918:11:59.994512 http.c:690 => Send header:
4018:12:00.608040 http.c:678 <= Recv header, 0000000017 bytes (0x00000011)
4118:12:00.608081 http.c:690 <= Recv header: HTTP/1.1 200 OK
4218:12:00.608088 http.c:678 <= Recv header, 0000000037 bytes (0x00000025)
4318:12:00.608092 http.c:690 <= Recv header: Date: Fri, 08 Oct 2021 16:12:00 GMT
4418:12:00.608098 http.c:678 <= Recv header, 0000000016 bytes (0x00000010)
4518:12:00.608102 http.c:690 <= Recv header: Server: Apache
4618:12:00.608106 http.c:678 <= Recv header, 0000000074 bytes (0x0000004a)
4718:12:00.608118 http.c:690 <= Recv header: Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
4818:12:00.608123 http.c:678 <= Recv header, 0000000023 bytes (0x00000017)
4918:12:00.608127 http.c:690 <= Recv header: X-Frame-Options: DENY
5018:12:00.608131 http.c:678 <= Recv header, 0000000040 bytes (0x00000028)
5118:12:00.608134 http.c:690 <= Recv header: Expires: Tue, 01 Jan 1980 00:00:00 GMT
5218:12:00.608138 http.c:678 <= Recv header, 0000000018 bytes (0x00000012)
5318:12:00.608142 http.c:690 <= Recv header: Pragma: no-cache
5418:12:00.608146 http.c:678 <= Recv header, 0000000053 bytes (0x00000035)
5518:12:00.608150 http.c:690 <= Recv header: Cache-Control: no-cache, max-age=0, must-revalidate
5618:12:00.608154 http.c:678 <= Recv header, 0000000059 bytes (0x0000003b)
5718:12:00.608158 http.c:690 <= Recv header: Content-Type: application/x-git-upload-pack-advertisement
5818:12:00.608163 http.c:678 <= Recv header, 0000000020 bytes (0x00000014)
5918:12:00.608166 http.c:690 <= Recv header: Content-Length: 62
6018:12:00.608170 http.c:678 <= Recv header, 0000000043 bytes (0x0000002b)
6118:12:00.608174 http.c:690 <= Recv header: Backend-Timing: D=1688 t=1633709520077989
6218:12:00.608178 http.c:678 <= Recv header, 0000000002 bytes (0x00000002)
6318:12:00.608182 http.c:690 <= Recv header:
6418:12:00.608197 http.c:731 == Info: Connection #0 to host gerrit.wikimedia.org left intact
6518:12:00.608388 http.c:731 == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
6618:12:00.608424 http.c:731 == Info: Found bundle for host gerrit.wikimedia.org: 0x7ff7d6d04a80 [can pipeline]
6718:12:00.608429 http.c:731 == Info: Could pipeline, but not asked to!
6818:12:00.608444 http.c:731 == Info: Re-using existing connection! (#0) with host gerrit.wikimedia.org
6918:12:00.608451 http.c:731 == Info: Connected to gerrit.wikimedia.org (208.80.154.137) port 443 (#0)
7018:12:00.608516 http.c:678 => Send header, 0000000300 bytes (0x0000012c)
7118:12:00.608524 http.c:690 => Send header: POST /r/mediawiki/core.git/git-upload-pack HTTP/1.1
7218:12:00.608530 http.c:690 => Send header: Host: gerrit.wikimedia.org
7318:12:00.608534 http.c:690 => Send header: User-Agent: git/2.24.3 (Apple Git-128)
7418:12:00.608537 http.c:690 => Send header: Accept-Encoding: deflate, gzip
7518:12:00.608541 http.c:690 => Send header: Content-Type: application/x-git-upload-pack-request
7618:12:00.608545 http.c:690 => Send header: Accept: application/x-git-upload-pack-result
7718:12:00.608548 http.c:690 => Send header: Git-Protocol: version=2
7818:12:00.608551 http.c:690 => Send header: Content-Length: 122
7918:12:00.608554 http.c:690 => Send header:
8018:12:00.608560 http.c:731 == Info: upload completely sent off: 122 out of 122 bytes
8118:12:00.897137 http.c:678 <= Recv header, 0000000017 bytes (0x00000011)
8218:12:00.897157 http.c:690 <= Recv header: HTTP/1.1 200 OK
8318:12:00.897163 http.c:678 <= Recv header, 0000000037 bytes (0x00000025)
8418:12:00.897167 http.c:690 <= Recv header: Date: Fri, 08 Oct 2021 16:12:00 GMT
8518:12:00.897172 http.c:678 <= Recv header, 0000000016 bytes (0x00000010)
8618:12:00.897176 http.c:690 <= Recv header: Server: Apache
8718:12:00.897180 http.c:678 <= Recv header, 0000000074 bytes (0x0000004a)
8818:12:00.897184 http.c:690 <= Recv header: Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
8918:12:00.897188 http.c:678 <= Recv header, 0000000023 bytes (0x00000017)
9018:12:00.897192 http.c:690 <= Recv header: X-Frame-Options: DENY
9118:12:00.897206 http.c:678 <= Recv header, 0000000040 bytes (0x00000028)
9218:12:00.897210 http.c:690 <= Recv header: Expires: Tue, 01 Jan 1980 00:00:00 GMT
9318:12:00.897224 http.c:678 <= Recv header, 0000000018 bytes (0x00000012)
9418:12:00.897237 http.c:690 <= Recv header: Pragma: no-cache
9518:12:00.897243 http.c:678 <= Recv header, 0000000053 bytes (0x00000035)
9618:12:00.897247 http.c:690 <= Recv header: Cache-Control: no-cache, max-age=0, must-revalidate
9718:12:00.897252 http.c:678 <= Recv header, 0000000052 bytes (0x00000034)
9818:12:00.897256 http.c:690 <= Recv header: Content-Type: application/x-git-upload-pack-result
9918:12:00.897260 http.c:678 <= Recv header, 0000000045 bytes (0x0000002d)
10018:12:00.897264 http.c:690 <= Recv header: Backend-Timing: D=143034 t=1633709520688117
10118:12:00.897269 http.c:678 <= Recv header, 0000000028 bytes (0x0000001c)
10218:12:00.897281 http.c:690 <= Recv header: Transfer-Encoding: chunked
10318:12:00.897285 http.c:678 <= Recv header, 0000000002 bytes (0x00000002)
10418:12:00.897288 http.c:690 <= Recv header:
10518:12:01.181998 http.c:731 == Info: Connection #0 to host gerrit.wikimedia.org left intact
10618:12:01.186693 http.c:731 == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
10718:12:01.186712 http.c:731 == Info: Found bundle for host gerrit.wikimedia.org: 0x7ff7d6d04a80 [can pipeline]
10818:12:01.186717 http.c:731 == Info: Could pipeline, but not asked to!
10918:12:01.186735 http.c:731 == Info: Re-using existing connection! (#0) with host gerrit.wikimedia.org
11018:12:01.186743 http.c:731 == Info: Connected to gerrit.wikimedia.org (208.80.154.137) port 443 (#0)
11118:12:01.186796 http.c:678 => Send header, 0000000300 bytes (0x0000012c)
11218:12:01.186805 http.c:690 => Send header: POST /r/mediawiki/core.git/git-upload-pack HTTP/1.1
11318:12:01.186809 http.c:690 => Send header: Host: gerrit.wikimedia.org
11418:12:01.186812 http.c:690 => Send header: User-Agent: git/2.24.3 (Apple Git-128)
11518:12:01.186824 http.c:690 => Send header: Accept-Encoding: deflate, gzip
11618:12:01.186828 http.c:690 => Send header: Content-Type: application/x-git-upload-pack-request
11718:12:01.186831 http.c:690 => Send header: Accept: application/x-git-upload-pack-result
11818:12:01.186835 http.c:690 => Send header: Git-Protocol: version=2
11918:12:01.186838 http.c:690 => Send header: Content-Length: 187
12018:12:01.186841 http.c:690 => Send header:
12118:12:01.186848 http.c:731 == Info: upload completely sent off: 187 out of 187 bytes
12218:12:01.825005 http.c:678 <= Recv header, 0000000017 bytes (0x00000011)
12318:12:01.825022 http.c:690 <= Recv header: HTTP/1.1 200 OK
12418:12:01.825029 http.c:678 <= Recv header, 0000000037 bytes (0x00000025)
12518:12:01.825033 http.c:690 <= Recv header: Date: Fri, 08 Oct 2021 16:12:01 GMT
12618:12:01.825038 http.c:678 <= Recv header, 0000000016 bytes (0x00000010)
12718:12:01.825042 http.c:690 <= Recv header: Server: Apache
12818:12:01.825046 http.c:678 <= Recv header, 0000000074 bytes (0x0000004a)
12918:12:01.825056 http.c:690 <= Recv header: Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
13018:12:01.825069 http.c:678 <= Recv header, 0000000023 bytes (0x00000017)
13118:12:01.825074 http.c:690 <= Recv header: X-Frame-Options: DENY
13218:12:01.825078 http.c:678 <= Recv header, 0000000040 bytes (0x00000028)
13318:12:01.825082 http.c:690 <= Recv header: Expires: Tue, 01 Jan 1980 00:00:00 GMT
13418:12:01.825087 http.c:678 <= Recv header, 0000000018 bytes (0x00000012)
13518:12:01.825090 http.c:690 <= Recv header: Pragma: no-cache
13618:12:01.825095 http.c:678 <= Recv header, 0000000053 bytes (0x00000035)
13718:12:01.825099 http.c:690 <= Recv header: Cache-Control: no-cache, max-age=0, must-revalidate
13818:12:01.825104 http.c:678 <= Recv header, 0000000052 bytes (0x00000034)
13918:12:01.825107 http.c:690 <= Recv header: Content-Type: application/x-git-upload-pack-result
14018:12:01.825112 http.c:678 <= Recv header, 0000000045 bytes (0x0000002d)
14118:12:01.825115 http.c:690 <= Recv header: Backend-Timing: D=495370 t=1633709521266171
14218:12:01.825120 http.c:678 <= Recv header, 0000000028 bytes (0x0000001c)
14318:12:01.825124 http.c:690 <= Recv header: Transfer-Encoding: chunked
14418:12:01.825128 http.c:678 <= Recv header, 0000000002 bytes (0x00000002)
14518:12:01.825140 http.c:690 <= Recv header:
146remote: Counting objects: 8957, done
147remote: Finding sources: 100% (8957/8957)
148remote: Getting sizes: 100% (7335/7335)
149remote: Compressing objects: 99% (93223/93224)
15018:17:13.944725 http.c:731 == Info: TLSv1.2 (IN), TLS alert, close notify (256):
15118:17:13.944745 http.c:731 == Info: transfer closed with outstanding read data remaining
15218:17:13.944760 http.c:731 == Info: Closing connection 0
15318:17:13.944787 http.c:731 == Info: TLSv1.2 (OUT), TLS alert, close notify (256):
154error: RPC failed; curl 18 transfer closed with outstanding read data remaining
155fatal: the remote end hung up unexpectedly
156fatal: early EOF
157fatal: index-pack failed

Which on the backend side is:

2021-10-08T15:40:09.554Z] [HTTP POST /r/mediawiki/core.git/git-upload-pack (N/A from x.x.x.x)] WARN  org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/core.git
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
        at org.eclipse.jetty.server.HttpOutput.channelWrite(HttpOutput.java:269)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:861)
        at net.bull.javamelody.internal.web.FilterServletOutputStream.write(FilterServletOutputStream.java:88)
        at net.bull.javamelody.internal.web.CounterResponseStream.write(CounterResponseStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)
        at org.eclipse.jgit.util.TemporaryBuffer.write(TemporaryBuffer.java:128)
        at org.eclipse.jgit.transport.UploadPack$ResponseBufferedOutputStream.write(UploadPack.java:2371)
        at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:141)
        at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:120)
        at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:99)
        at java.base/java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
        at java.base/java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.eclipse.jgit.lib.ObjectLoader.copyTo(ObjectLoader.java:237)
...

The related Gerrit setting is https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#httpd.idleTimeout

httpd.idleTimeout

Maximum idle time for a connection, which roughly translates to the TCP socket SO_TIMEOUT.

This value is interpreted as the maximum time between some progress being made on the connection. So if a single byte is read or written, then the timeout is reset.

The max idle time is applied:

  • When waiting for a new message to be received on a connection
  • When waiting for a new message to be sent on a connection

By default, 30 seconds.

The issue was due to the proxy config and the other error Zeljko is encountered does not seem related. I will file another task.

15:57:05 INFO:zuul.Cloner.mediawiki/extensions/CirrusSearch:Creating repo mediawiki/extensions/CirrusSearch from upstream https://gerrit.wikimedia.org/r/mediawiki/extensions/CirrusSearch
...
15:57:11 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
15:57:11   cmdline: git fetch --tags -v origin
15:57:11   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CirrusSearch/': The requested URL returned error: 502'

From apache around that time:

/var/log/apache2/gerrit.wikimedia.org.https.error.log
[Thu Dec 16 15:57:09.680632 2021] [proxy_http:error] [pid 8030:tid 139643982952192] (20014)Internal error (specific information not available): [client 172.16.6.197:54922] AH01102: error reading status line from remote server 127.0.0.1:80
80
[Thu Dec 16 15:57:09.680713 2021] [proxy:error] [pid 8030:tid 139643982952192] [client 172.16.6.197:54922] AH00898: Error reading from remote server returned by /r/mediawiki/extensions/CirrusSearch/info/refs

That is the IP of the instance that ran the build and the error does refer to CirrusSearch which failed to clone. The same happened for AbuseFilter at 16:10:33.

There a couple errors before that:

[2021-12-16T15:50:18.859Z] [HTTP POST /r/mediawiki/vendor/git-upload-pack (N/A from 172.16.3.126)] WARN org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/vendor.git
org.eclipse.jetty.io.EofException
[2021-12-16T15:50:19.197Z] [HTTP POST /r/mediawiki/vendor/git-upload-pack (N/A from 172.16.7.210)] WARN org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/vendor.git
org.eclipse.jetty.io.EofException

But I think that is due to build being aborted.

Maybe there is an issue with the Jetty pool but I don't see anything on https://grafana.wikimedia.org/d/fe848RoMz/http-jetty it had plenty of idle slot.

For Apache the pool is at https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?orgId=1&refresh=1m&var-host=gerrit1001&var-port=9117 and I dont see much either there

I don't have any idea as to how to debug Apache mod_proxy and the AH00898 error :-\

https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-docker-publish/52844/console is another instance.

05:07:50   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof/': The requested URL returned error: 502'
10:32:51   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/vendor/': The requested URL returned error: 502'

https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/172239/console

Timo mentioned this on the recent T417536: ATS causes git fetches from Gerrit to fail with 502 responses. After moving Gerrit behind the ATS edge cache, git clones over https had some 502 errors. This was worked around by disabling connection reuse on the ATS cache (puppet #1239888).

This task be the same issue between Apache mod_proxy and the Gerrit Jetty web server. If mod_proxy timeout is larger than the Jetty one, Apache could start emitting a request to the backend which is being closed by Jetty. That has supposedly disappeared after turning KeepAlive on in mod_proxy (puppet 591304), I think that is because it makes Apache to refresh the connection on the Jetty side and prevent Jetty from terminating it.

In both case it seems to be a timeout race condition between the two servers :-\


Apache ProxyPass documentation mentions the keep alive probes are based on the OS. On gerrit2003 that would be:

$ grep . /proc/sys/net/ipv4/tcp_keepaliv*
/proc/sys/net/ipv4/tcp_keepalive_intvl:1
/proc/sys/net/ipv4/tcp_keepalive_probes:2
/proc/sys/net/ipv4/tcp_keepalive_time:300

That is wait for 300 seconds before sending the first probe, resent it every 1 second and if no response after 2 attempts then mark the connection broken.

Then 300 seconds sounds like a lot and I imagine Jetty terminates it earlier than that. Gerrit httpd.idleTimeout is set at 30 seconds. Thus I do not know how setting Keepalive=on in Apache mod_proxy managed to fix it up :-\

hashar reopened this task as Open.EditedFeb 20 2026, 11:23 PM

I am reopening this 6 years old task because Apache still sometime logs the errors:

  • [proxy_http:error] AH01102: error reading status line from remote server 127.0.0.1:8080
  • [proxy:error] AH00898: Error reading from remote server returned by <path>
  • Apache2 yielding a 502 to the client.
Recent context

We had a similar issue the last two weeks after moving Gerrit behind ATS. The connection reuse causes the frontend ATS to use a connection it considers fine (long timeout of 120s) when the backend side (Apache Gerrit) is about to terminate it due to a shorter timeout (5s). ATS can then send a packet over a connection the backend has closed by the time the packet arrives. A FIN is sent back and the frontend got a connection lost and yields a 502 to the client.

That was worked around by disabling connection reuse on the frontend ATS. Reenabling connection reuses requires raising the timeout on the backend Apache to a value higher than the frontend ATS (120s), that will be done via T417998.

History

Daniel made Apache mod_proxy to use keepalive which apparently mostly solved the issue. I believe the remaining errors are due to a similar race condition of timeout in the chain of Apachemod_proxyGerrit's Jetty.

In my previous comment, I expressed doubt about turning on keepalive on mod_proxy to have solved the issue, so I will dig into the timeout values.

Timeouts on the chain

Apache has KeepAliveTimeout 5 seconds

mod_proxy has retry=0 Keepalive=On. The documentation lists the defaults values:

enablereuseOnEnable reusing connections
keepaliveOnKEEP_ALIVE at frequency set by OS (see my previous comment)
timeoutProxyTimeoutSocket timeout in seconds. The number of seconds Apache httpd waits for data sent by / to the backend.

ProxyTimeout defaults to the value of TimeOut which is 300 seconds from the stock Debian config.

Gerrit's Jetty has the default 30 seconds idle timeout (httpd.idleTimeout = 30)

Hypothesis

mod_proxy can send a packet over a connection for as long as 300 seconds (Debian TimeOut value).

Because Gerrit's Jetty has a shorter 30 seconds timeout:

  • by the tiny short time it takes for mod_proxy to send the packet to Jetty, the connection might have reached its timeout on the Jetty side
  • Jetty receives a packet for a connection it has just closed
  • Jetty sends a FIN
  • mod_proxy_http rightfully considers the backend to not be readable and logs AH01102: error reading status line from remote server 127.0.0.1:8080
  • mod_proxy logs AH00898: Error reading from remote server returned by <path>
  • Apache sends back a 502 bad gateway to the client.
Action

Tune the timeouts! The mod_proxy timeout (300s) MUST have a shorter timeout than Gerrit Jetty's (30s).

mod_proxy must be set to a lower value in order to prevent it from reusing a connection Jetty's is about to terminate. That is a similar fix which is proposed for ATS/Apache2 (T417998).

Change #1241048 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: ProxyTimeout shorter than Jetty's idle timeout

https://gerrit.wikimedia.org/r/1241048

@hashar Thanks. I agree the frontend (apache) should have a lower timeout than Jetty. Also +1'ing the suggested code change.

Dzahn lowered the priority of this task from High to Medium.Feb 25 2026, 10:34 PM

Lowering prio a little because this was closed for 6 years and not directly related to the recent CDN move.

Change #1241048 merged by Arnaudb:

[operations/puppet@production] gerrit: ProxyTimeout shorter than Jetty's idle timeout

https://gerrit.wikimedia.org/r/1241048

The change has been merged, I'll apply it on spare, then replica and then on the primary

I've test the spare instance in "primary mode" (i.e. without the --replica flag and no replication), with siege and a random set of 30ish URLs: siege -i -f siege.txt -c5 -t1M -d0.1 --no-parser -A 'test: arnaudb' with a stable performance output, I'll apply the change to the replica instance. I've also tried to test with a higher level of concurrency but I quickly end up throttled by the CDN with 429 errors.

the change has been applied to all gerrit instances

Dzahn removed Dzahn as the assignee of this task.Mar 17 2026, 4:08 PM

Thanks for resolving this :)

Just to check, is this definitely now resolved again? Asking as 3 jobs at https://gerrit.wikimedia.org/r/1254998#message-168784c5816abc8f5e649b0120a2eb6755979c92 failed with stderr: 'error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502 earlier. (Unless that's a different problem?)

edit: maybe Gerrit was just having some problems (ie., not just for CI) at the time that those jobs were running?
edit 2: T420523: ProbeDown - Service gerrit2003:443 has failed probes (http_gerrit_tls_ip6)?

Just to check, is this definitely now resolved again?

I think this is not resolved, the errors you highlighted happened at time where Gerrit had no apparent problem. Reopening the task!

ABran-WMF claimed this task.

Hey @neriah, thanks for raising that issue. With {T420595} and  T420909: gerrit: Add Envoy in Gerrit's stack the situation should be improved.
I'm marking this task as resolved, please let us know if it does not, or it does not sufficiently, and feel free to reopen this task.

We might also beneficiate from a retry policy on these errors, HTH for that as well!

Starting March 19, 2026, 4 out of 8 runs of the daily https://releases-jenkins.wikimedia.org/job/MediaWiki%20branch%20and%20publish%20WMF%20single-version%20image/ job have failed with "error: RPC failed; HTTP 502 curl 22 The requested URL returned error: 502" while cloning from Gerrit.

This job runs at 01:00 UTC, in case that helps with debugging.

Note: If you get a 404 when visiting the job URL, click the "Sign In" button on the top left of the page, then try again.

Note: If you get a 404 when visiting the job URL, click the "Sign In" button on the top left of the page, then try again.

thanks for the link @dancy I've tried to sign in but still have a 404

Note: If you get a 404 when visiting the job URL, click the "Sign In" button on the top left of the page, then try again.

thanks for the link @dancy I've tried to sign in but still have a 404

Try now please.

Try now please.

sorry I missed your message on Friday, could I trigger the build to fetch the logs before they expire and test hypothesis?

Try now please.

sorry I missed your message on Friday, could I trigger the build to fetch the logs before they expire and test hypothesis?

Yes, you should have access to do that now.

Yes, you should have access to do that now.

Thanks! I'll follow up soon

follow up done in T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026 and T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs:
Envoy upstream timeout has been updated to allow longer git fetch commands, let me know if that issue is still present after that.

Change #1266231 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] gerrit: replace ProxyTimeout by ProxyPass ttl

https://gerrit.wikimedia.org/r/1266231

Change #1266231 merged by Arnaudb:

[operations/puppet@production] gerrit: replace ProxyTimeout by ProxyPass ttl

https://gerrit.wikimedia.org/r/1266231

hashar reopened this task as Open.EditedApr 1 2026, 1:50 PM

Summary

I reopened this task on Feb 21st with a thorough analysis of the problem (T246763#11637703).

I have made a patch to set Proxytimeout 25 in Apache (operations/puppet 1241048 ) which was deployed on March 17th and the task was then marked resolved (it did get checked with siege but the issue is rare enough that logs should be checked instead).

We have multiple proxies between end users and the Gerrit backend:
CDN (Apache Traffic Server)EnvoyApachemod_proxyJetty

Some of their keepalive/idle/connection timeout have been changed over the last couple weeks and might have caused some of the issue reported on this task. I haven't looked at them closely to check whether they were related. Notably Envoy seems to cut any request that takes more than 2 minutes which would match any long clone, that is to be resolved in another task.

The ProxyTimeout 25 I have introduced does cause Apache to stop waiting for Gerrit if it takes it more than 30 seconds to reply. That in turns caused some long requests (such as generating a git blame) to be aborted, that was filed as T421904. Removing the ProxyTimeout 25 resolves that.

And to solve this issue T246763 and prevent Apache mod_proxy to reuse an idling connection that the backend Gerrit Jetty might decided to close, I have set ttl=25 to the ProxyPass directive:

ttl can be used to avoid using a connection which is subject to closing because of the backend server's keep-alive timeout.

And that should fix this specific task which is a rare race condition between Apache mod proxy and Gerrit Jetty.

As mentioned in T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs:

[...] Please let me know if you still see these 502 errors.

I think we should update a single ticket for that issue, I suggest using T421827 for that matter. Closing that one, feel free to reopen if needed.

hashar claimed this task.

@ABran-WMF this T246763 is the canonical issue for the mod_proxyJetty path. I have reopened it because although it happens less frequently, it still hasn't been complete fixed. I gave some hypothesis when reopening at T246763#11637703

The last iteration is that in my first fix, I have misunderstood ProxyTimeout 25 which does cut a connection if the response did not arrives after a 25 seconds, when instead I wanted the connection to be shut if it has been unused (idling). The ProxyPass ttl option would do that (T246763#11778054). But I haven't verified and it needs some days of traffic for that.

Thus for this specific problem I would like to verify it is solved before claiming the task to have been resolved.