Page MenuHomePhabricator

Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos
Closed, ResolvedPublic

Description

Started seeing this today, example URLs from https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/570730#message-756e21dd45f85c38d9b21ff3e774d66ded724927:

12:45:36 INFO:zuul.Cloner.mediawiki/extensions/AbuseFilter:Updating origin remote in repo mediawiki/extensions/AbuseFilter to https://gerrit.wikimedia.org/r/mediawiki/extensions/AbuseFilter
12:45:37 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof
12:45:37 Traceback (most recent call last):
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__
12:45:37     self._ensure_cloned()
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
12:45:37     git.Repo.clone_from(self.remote_url, self.local_path)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from
12:45:37     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone
12:45:37     finalize_process(proc, stderr=stderr)
12:45:37   File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process
12:45:37     proc.wait(**kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait
12:45:37     raise GitCommandError(self.args, status, errstr)
12:45:37 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
12:45:37   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof /workspace/src/extensions/AntiSpoof
12:45:37   stderr: 'Cloning into '/workspace/src/extensions/AntiSpoof'...
12:45:37 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof/': The requested URL returned error: 502
12:45:37 '
12:45:37 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/CiteThisPage
12:45:37 Traceback (most recent call last):
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__
12:45:37     self._ensure_cloned()
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
12:45:37     git.Repo.clone_from(self.remote_url, self.local_path)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from
12:45:37     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone
12:45:37     finalize_process(proc, stderr=stderr)
12:45:37   File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process
12:45:37     proc.wait(**kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait
12:45:37     raise GitCommandError(self.args, status, errstr)
12:45:37 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
12:45:37   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/extensions/CiteThisPage /workspace/src/extensions/CiteThisPage
12:45:37   stderr: 'Cloning into '/workspace/src/extensions/CiteThisPage'...
12:45:37 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CiteThisPage/': The requested URL returned error: 502
12:45:37 '
12:45:37 ERROR:zuul.Repo:Unable to initialize repo for https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser
12:45:37 Traceback (most recent call last):
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 51, in __init__
12:45:37     self._ensure_cloned()
12:45:37   File "/usr/local/lib/python3.5/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
12:45:37     git.Repo.clone_from(self.remote_url, self.local_path)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 925, in clone_from
12:45:37     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 880, in _clone
12:45:37     finalize_process(proc, stderr=stderr)
12:45:37   File "/usr/lib/python3/dist-packages/git/util.py", line 341, in finalize_process
12:45:37     proc.wait(**kwargs)
12:45:37   File "/usr/lib/python3/dist-packages/git/cmd.py", line 291, in wait
12:45:37     raise GitCommandError(self.args, status, errstr)
12:45:37 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
12:45:37   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser /workspace/src/extensions/CheckUser
12:45:37   stderr: 'Cloning into '/workspace/src/extensions/CheckUser'...
12:45:37 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CheckUser/': The requested URL returned error: 502
12:45:37 '

Event Timeline

kostajh triaged this task as High priority.Mar 3 2020, 12:01 PM
kostajh created this task.

This has been happening occasionally for a while now, but it does seem to become more common lately.

Krinkle renamed this task from 502 errors when cloning repos for wmf-quibble-vendor-mysql-php72-docker and mwext-php72-phan-docker jobs to Jenkins job failing intermittently due to Gerrit HTTP 502 errors when cloning repos.Mar 10 2020, 1:44 AM
Krinkle subscribed.

Seen a few times a day or every other day it seems.

FWIW, I’ve just encountered this for the third time in the space of an hour.

Lucas_Werkmeister_WMDE raised the priority of this task from High to Unbreak Now!.Mar 10 2020, 12:19 PM

So far today, I’ve seen more gate-and-submit failures (all due to this bug) than succeeding builds. I think this is common enough to warrant UBN! status now.

Strangely, I'm not seeing the graphs in gerrit monitoring reporting a rise in error rate (this counts 4xx as well as 5xx so maybe it's lost in the noise). Logs are also not showing a massive increase in the error rate @Lucas_Werkmeister_WMDE did you run into this on one repo in particular? One job in particular?

Looking specifically for upload-pack errors that happened over https in the past 30 days (this is what zuul would be using)

$ zgrep -cP 'proxy-server/502.*upload-pack' gerrit.wikimedia.org.https.access.log*
gerrit.wikimedia.org.https.access.log:53
gerrit.wikimedia.org.https.access.log.1:51
gerrit.wikimedia.org.https.access.log.2.gz:20
gerrit.wikimedia.org.https.access.log.3.gz:22
gerrit.wikimedia.org.https.access.log.4.gz:93
gerrit.wikimedia.org.https.access.log.5.gz:54
gerrit.wikimedia.org.https.access.log.6.gz:63
gerrit.wikimedia.org.https.access.log.7.gz:103
gerrit.wikimedia.org.https.access.log.8.gz:71
gerrit.wikimedia.org.https.access.log.9.gz:8
gerrit.wikimedia.org.https.access.log.10.gz:15
gerrit.wikimedia.org.https.access.log.11.gz:32
gerrit.wikimedia.org.https.access.log.12.gz:24
gerrit.wikimedia.org.https.access.log.13.gz:24
gerrit.wikimedia.org.https.access.log.14.gz:38
gerrit.wikimedia.org.https.access.log.15.gz:53
gerrit.wikimedia.org.https.access.log.16.gz:1
gerrit.wikimedia.org.https.access.log.17.gz:9
gerrit.wikimedia.org.https.access.log.18.gz:22
gerrit.wikimedia.org.https.access.log.19.gz:6
gerrit.wikimedia.org.https.access.log.20.gz:8
gerrit.wikimedia.org.https.access.log.21.gz:2
gerrit.wikimedia.org.https.access.log.22.gz:0
gerrit.wikimedia.org.https.access.log.23.gz:0
gerrit.wikimedia.org.https.access.log.24.gz:1
gerrit.wikimedia.org.https.access.log.25.gz:1
gerrit.wikimedia.org.https.access.log.26.gz:1
gerrit.wikimedia.org.https.access.log.27.gz:1
gerrit.wikimedia.org.https.access.log.28.gz:1
gerrit.wikimedia.org.https.access.log.29.gz:1
gerrit.wikimedia.org.https.access.log.30.gz:0

Unsure if noisy or useful yet. Counts seem to have increased the past week+.

thcipriani lowered the priority of this task from Unbreak Now! to High.Mar 10 2020, 6:56 PM

Lowering priority since this hasn't happened once in the hour I've spent investigating according to the access logs.

Oddly though, I'm having difficulty correlating 502s in the access logs to any errors happening on the gerrit side. That is, I see a 502 in the access log, I dig through /var/log/gerrit/gerrit.log for the corresponding backend error and I can't find anything.

Of all the upload-pack errors in gerrit logs I see there are 2 causes:

Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms

And

Caused by: java.io.IOException: Connection reset by peer

That is, slow requests; however, if you look at the timestamps in the task description zuul tries to clone a couple repos and 1 second later errors out, so these aren't caused by request timeouts or slow connections.

We noticed that /var/log/syslog on gerrit1001 was empty and then that there were 2 rsyslogd processes. We killed one, stopped the other and let puppet restart it. Then logging appeared to start again.

So now we might see more if it happens again.

Still seen 30min ago at change 578402 (Jenkins job)

(UTC) 22:35:26   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/JsonConfig/': The requested URL returned error: 502'

OK, looking at Jenkins this happened today at 11:20 UTC

[thcipriani@contint1001 wmf-quibble-selenium-php72-docker]$ stat 25459/log 
...
Modify: 2020-03-11 11:20:41.431684891 +0000
...

I can correlate that to the apache logs on gerrit1001:

$ grep 'T11:20' gerrit.wikimedia.org.https.access.log | grep upload-pack | grep 'mediawiki/core' | grep -i 'proxy-server/5'
2020-03-11T11:20:34     3070820 X.X.X.X   proxy-server/502        341     GET     http://gerrit.wikimedia.org/r/mediawiki/core/info/refs?service=git-upload-pack -       text/html       -       -       git/2.11.0      C, *;q=0.9      -       -       -       X.X.X.X    -       -

Trying to find that in gerrit logs...

$ grep '11 11:' /var/log/gerrit/gerrit.log
[2020-03-11 11:09:13,034] [HTTP-829592] WARN  /r : Internal error during upload-pack from /srv/gerrit/git/operations/mediawiki-config.git
[2020-03-11 11:30:06,507] [HTTP-831439] WARN  /r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/vagrant.git
[2020-03-11 11:38:30,689] [ReceiveCommits-32] WARN  com.google.gerrit.server.git.receive.ReceiveCommits : [mediawiki/core-1583926709402-9dc5fd94]Skipping change updates on mediawiki/core because ref update failed: REJECTED_OTHER_REASON no new changes
[2020-03-11 11:38:51,706] [HTTP-830856] WARN  /r : Internal error during upload-pack from /srv/gerrit/git/operations/mediawiki-config.git

Nothing at 11:20 and nothing about mediawiki/core

And in the logs the only upload-pack problems I see are with timeouts or client issues:

root@gerrit1001:/var/log/apache2# grep -A100 -P 'HTTP.*upload-pack.*mediawiki/core' /var/log/gerrit/gerrit.log | grep -i 'caused by'
Caused by: java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
Caused by: java.io.IOException: Connection reset by peer
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms

However, in the jenkins log gerrit took 5 seconds (at most) to return a 502; the apache log says it took 3070820 microseconds (3 seconds) -- it didn't hit a timeout.

My suspicion here is that apache is failing somehow but I'm not sure how to verify that just yet.

This hasn't happened today afaict:

$ grep -cP 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log
0

Interestingly, I did find the problem I saw yesterday in an apache error log with a little more info:

$ grep -iP 'error.*refs' /var/log/apache2/gerrit.wikimedia.org.https.error.log.1 | grep -i core | grep '11 11:20'
[Wed Mar 11 11:20:37.702417 2020] [proxy:error] [pid 1597:tid 140482122340096] [client 172.16.X.X:60058] AH00898: Error reading from remote server returned by /r/mediawiki/core/info/refs

^ this is the same as the error above that caused a test failure:

$ grep 'T11:20' gerrit.wikimedia.org.https.access.log | grep upload-pack | grep 'mediawiki/core' | grep -i 'proxy-server/5'
2020-03-11T11:20:34     3070820 X.X.X.X   proxy-server/502        341     GET     http://gerrit.wikimedia.org/r/mediawiki/core/info/refs?service=git-upload-pack -       text/html       -       -       git/2.11.0      C, *;q=0.9      -       -       -       172.16.X.X    -    

Some googling seems to have turned up some mod_proxy environment variables related to keepalive/connection pools: https://httpd.apache.org/docs/current/mod/mod_proxy_http.html

Seems like there is a race condition with connection pools in mod_proxy, "the "proxy: error reading status line from remote server" error message caused by the race condition that the backend server closed the pooled connection after the connection check by the proxy and before data sent by the proxy reached the backend."

Although I don't understand why this problem started this past week, I could believe that this is at least a problem (if not the problem). http keepalives and mod_proxy connection pools are at least making the interaction between apache and jetty more complicated to reason about.

Change 579601 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Gerrit: apache proxy not pooled

https://gerrit.wikimedia.org/r/579601

I have another task about Gerrit being unable to reach the MariaDB SQL server. Apparently due to the connection being lost / arbitrarily closed (T247591: Investigate Gerrit troubles to reach the MariaDB database). The two issues might be sharing a common low level cause. We would have to compare the occurrences of each issues.

This stopped abruptly 21 days ago:

$ zgrep -cP 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log*
/var/log/apache2/gerrit.wikimedia.org.https.access.log:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.1:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.2.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.3.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.4.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.5.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.6.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.7.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.8.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.9.gz:3
/var/log/apache2/gerrit.wikimedia.org.https.access.log.10.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.11.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.12.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.13.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.14.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.15.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.16.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.17.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.18.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.19.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.20.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.21.gz:46
/var/log/apache2/gerrit.wikimedia.org.https.access.log.22.gz:65
/var/log/apache2/gerrit.wikimedia.org.https.access.log.23.gz:51
/var/log/apache2/gerrit.wikimedia.org.https.access.log.24.gz:20
/var/log/apache2/gerrit.wikimedia.org.https.access.log.25.gz:22
/var/log/apache2/gerrit.wikimedia.org.https.access.log.26.gz:93
/var/log/apache2/gerrit.wikimedia.org.https.access.log.27.gz:54
/var/log/apache2/gerrit.wikimedia.org.https.access.log.28.gz:63
/var/log/apache2/gerrit.wikimedia.org.https.access.log.29.gz:103
/var/log/apache2/gerrit.wikimedia.org.https.access.log.30.gz:71

21 days ago was:

$ date --utc --date='21 days ago'
Wed 11 Mar 2020 03:21:36 PM UTC

Based on the comment at T247591#5975312 and the date I'd say this is correlated with the DB connection issues.

This may have had several causes. I noticed the the problem slowed down considerably when I updated the CI git caches (leading me to write https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/579602/ which is now active on the integration cluster).

Boldly (and somewhat unsatisfyingly) closing this task as this problem is no longer happening.

Change 579601 abandoned by Thcipriani:
Gerrit: apache proxy not pooled

Reason:
T246763 resolved.

https://gerrit.wikimedia.org/r/579601

Won't re-open just yet, but I did encounter this again yesterday at https://gerrit.wikimedia.org/r/588432 / Jenkins build

fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/TimedMediaHandler/': The requested URL returned error: 502
Finished: FAILURE

In T247591, I had the DBA team to look at the database serving Gerrit (m2-master). We have ruled it out, or at least it is not more occurring. The 502 remains a mystery, that might be some pool overflowing :-\

Won't re-open just yet, but I did encounter this again yesterday at https://gerrit.wikimedia.org/r/588432 / Jenkins build

fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/TimedMediaHandler/': The requested URL returned error: 502
Finished: FAILURE

That one had two other requests rejected with the exact same messages. From Apache logs:

[Mon Apr 13 17:32:29.757344 2020] [proxy_http:error]
(20014)Internal error (specific information not available): [client 172.16.7.210:54282] AH01102: error reading status line from remote server 127.0.0.1:8080
[client 172.16.7.210:54282] AH00898: Error reading from remote server returned by /r/mediawiki/extensions/TimedMediaHandler/info/refs

The client IP is the instance the job ran on.

I looked in April 13th Apache log for any errors, they happen in short bursts. Filtering for AH00898 from above:

10:58:35
11:31:04
11:56:15
12:01:01
15:24:03
15:24:03
15:24:03
16:32:45
16:32:45
16:32:45
16:33:33
17:08:03
17:08:04
17:08:48
17:12:18
17:12:18
17:12:18
17:22:06
17:32:29
17:32:29
17:32:29
18:36:37
18:36:37
18:36:37
18:54:09
19:00:38
19:00:38
19:00:38
19:00:38
19:15:01
19:19:35
19:19:35
19:28:43
19:55:47
20:10:34
20:10:34
20:56:36
23:10:36

My guess is that something is spiking every once in a while and cause something on the Gerrit side to stop responding.

In Gerrit we have:

gerrit.config
[httpd]
    listenUrl = proxy-https://127.0.0.1:8080/r/
    maxQueued = 500
    minThreads = 10
    maxThreads = 60
    maxWait = 2 min

Maybe there is not enough thread or the queue get filled. We would need to find out whether there is a log bucket available in Gerrit to diagnose those.

this started happening again as of a few days ago:

$ zgrep -cP 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log*
/var/log/apache2/gerrit.wikimedia.org.https.access.log:20
/var/log/apache2/gerrit.wikimedia.org.https.access.log.1:32
/var/log/apache2/gerrit.wikimedia.org.https.access.log.2.gz:17
/var/log/apache2/gerrit.wikimedia.org.https.access.log.3.gz:4
/var/log/apache2/gerrit.wikimedia.org.https.access.log.4.gz:8
/var/log/apache2/gerrit.wikimedia.org.https.access.log.5.gz:12
/var/log/apache2/gerrit.wikimedia.org.https.access.log.6.gz:5
/var/log/apache2/gerrit.wikimedia.org.https.access.log.7.gz:3
/var/log/apache2/gerrit.wikimedia.org.https.access.log.8.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.9.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.10.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.11.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.12.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.13.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.14.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.15.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.16.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.17.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.18.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.19.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.20.gz:2

That one had two other requests rejected with the exact same messages. From Apache logs:

[Mon Apr 13 17:32:29.757344 2020] [proxy_http:error]
(20014)Internal error (specific information not available): [client 172.16.7.210:54282] AH01102: error reading status line from remote server 127.0.0.1:8080
[client 172.16.7.210:54282] AH00898: Error reading from remote server returned by /r/mediawiki/extensions/TimedMediaHandler/info/refs

This specific error is in the mod_proxy docs: https://httpd.apache.org/docs/current/mod/mod_proxy_http.html

If this variable is set, no pooled connection will be reused if the client request is the initial request on the frontend connection. This avoids the "proxy: error reading status line from remote server" error message caused by the race condition that the backend server closed the pooled connection after the connection check by the proxy and before data sent by the proxy reached the backend. It has to be kept in mind that setting this variable downgrades performance, especially with HTTP/1.0 clients.

The reason I suspected apache earlier is because there were no corresponding errors in the gerrit log related to these 502s.

Change 579601 restored by Thcipriani:
Gerrit: apache proxy not pooled

https://gerrit.wikimedia.org/r/579601

Reedy renamed this task from Jenkins job failing intermittently due to Gerrit HTTP 502 errors when cloning repos to Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos.Apr 19 2020, 4:23 PM
Reedy subscribed.
reedy@deploy1001:/srv/mediawiki-staging$ git pull
fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/mediawiki-config/': The requested URL returned error: 502
reedy@deploy1001:/srv/mediawiki-staging$ git pull
fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/mediawiki-config/': The requested URL returned error: 502

I see this in the apache logs:

$ zgrep -P 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log.1 | grep -i mediawiki-config
2020-04-19T16:17:22     204716  <IPv6>      proxy-server/502        341     GET     http://gerrit.wikimedia.org/r/operations/mediawiki-config/info/refs?service=git-upload-pack    -       text/html       -       -       git/2.11.0      en-US, *;q=0.9  -       -       -       <IPv6> 

I see this in the apache error logs:

$ zgrep -P 'Apr\s+19\s+16:17' /var/log/apache2/gerrit.wikimedia.org.https.error.log.1
[Sun Apr 19 16:17:22.401452 2020] [proxy_http:error] [pid 30708:tid 140482634032896] (20014)Internal error (specific information not available): [client <IPv6>] AH01102: error reading status line from remote server 127.0.0.1:8080
[Sun Apr 19 16:17:22.401502 2020] [proxy:error] [pid 30708:tid 140482634032896] [client <IPv6>] AH00898: Error reading from remote server returned by /r/operations/mediawiki-config/info/refs

I don't see anything that minute from gerrit:

$ zgrep -P '2020-04-19\s+16:17' /var/log/gerrit/gerrit.log.2020-04-19
$ 

Most errors reported in apache status line errors:

$ grep 'proxy_http:error' /var/log/apache2/gerrit.wikimedia.org.https.error.log | grep -vc 'status line'
5
$ grep 'proxy_http:error' /var/log/apache2/gerrit.wikimedia.org.https.error.log | grep -c 'status line'
59

Change 591304 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: add Keepalive=on to ProxyPass config lines

https://gerrit.wikimedia.org/r/591304

I found https://serverfault.com/questions/185894/proxy-error-502-reason-error-reading-from-remote-server-with-apache-2-2-3-de where the reporter is also using httpd to (reverse)-proxy to jetty and he ran into the same 502s - error reading from remote server".

He also mentions trying the exact same changes as suggested by Tyler above (SetEnv proxy-initial-not-pooled, keepalive Off ) but says they did not help.

Then he goes on saying he solved the issue by adding "keepalive=On" to the ProxyPass lines.

Since, as described above, we want to prevent that the backend server closes the pooled connection early, before data sent by the proxy reached the backend, it does seem to make sense to me that we want Keepalive to be turned ON and not off.

That made me create the patch above. Does it make sense?

Change 591304 merged by Dzahn:
[operations/puppet@production] gerrit: add Keepalive=on to ProxyPass config lines

https://gerrit.wikimedia.org/r/591304

Deployed the change above and restarted httpd on gerrit1001.

Deployed the change above and restarted httpd on gerrit1001.

Awesome! Thanks, I'll monitor traffic.

thcipriani reassigned this task from thcipriani to Dzahn.

502s seem to have dropped off quite a bit. Optimistically closing once again.

$ zgrep -cP 'proxy-server/502.*upload-pack' /var/log/apache2/gerrit.wikimedia.org.https.access.log*
/var/log/apache2/gerrit.wikimedia.org.https.access.log:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.1:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.2.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.3.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.4.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.5.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.6.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.7.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.8.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.9.gz:0
/var/log/apache2/gerrit.wikimedia.org.https.access.log.10.gz:63
/var/log/apache2/gerrit.wikimedia.org.https.access.log.11.gz:51
/var/log/apache2/gerrit.wikimedia.org.https.access.log.12.gz:9
/var/log/apache2/gerrit.wikimedia.org.https.access.log.13.gz:23
/var/log/apache2/gerrit.wikimedia.org.https.access.log.14.gz:28
/var/log/apache2/gerrit.wikimedia.org.https.access.log.15.gz:33
/var/log/apache2/gerrit.wikimedia.org.https.access.log.16.gz:26
/var/log/apache2/gerrit.wikimedia.org.https.access.log.17.gz:32
/var/log/apache2/gerrit.wikimedia.org.https.access.log.18.gz:17
/var/log/apache2/gerrit.wikimedia.org.https.access.log.19.gz:4
/var/log/apache2/gerrit.wikimedia.org.https.access.log.20.gz:8
/var/log/apache2/gerrit.wikimedia.org.https.access.log.21.gz:12
/var/log/apache2/gerrit.wikimedia.org.https.access.log.22.gz:5
/var/log/apache2/gerrit.wikimedia.org.https.access.log.23.gz:3
/var/log/apache2/gerrit.wikimedia.org.https.access.log.24.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.25.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.26.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.27.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.28.gz:1
/var/log/apache2/gerrit.wikimedia.org.https.access.log.29.gz:2
/var/log/apache2/gerrit.wikimedia.org.https.access.log.30.gz:1

Change 579601 abandoned by Thcipriani:
Gerrit: apache proxy not pooled

Reason:
superseded by https://gerrit.wikimedia.org/r/591304

https://gerrit.wikimedia.org/r/579601

Now they’re HTTP 500 errors (presumably due to T254158: Gerrit 3.2 upgrade): Jenkins build

12:29:30 + git fetch --quiet --update-head-ok --depth 2 git://contint2001.wikimedia.org/mediawiki/extensions/Wikibase +refs/zuul/master/Z7f6f22c8417c4636b6c2caec7666f5c1:refs/zuul/master/Z7f6f22c8417c4636b6c2caec7666f5c1
12:29:31 + git checkout -B master FETCH_HEAD
12:29:31 Reset branch 'master'
12:29:31 + set +x
12:29:31 + git submodule --quiet update --init --recursive
12:29:31 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-data-model.git/': The requested URL returned error: 500
12:29:31 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-data-model.git' into submodule path '/src/view/lib/wikibase-data-model' failed
12:29:31 Failed to clone 'view/lib/wikibase-data-model'. Retry scheduled
12:29:31 fatal: unable to access 'https://phabricator.wikimedia.org/source/datavalues-javascript.git/': The requested URL returned error: 500
12:29:31 fatal: clone of 'https://phabricator.wikimedia.org/source/datavalues-javascript.git' into submodule path '/src/view/lib/wikibase-data-values' failed
12:29:31 Failed to clone 'view/lib/wikibase-data-values'. Retry scheduled
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-serialization.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-serialization.git' into submodule path '/src/view/lib/wikibase-serialization' failed
12:29:32 Failed to clone 'view/lib/wikibase-serialization'. Retry scheduled
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-termbox/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-termbox' into submodule path '/src/view/lib/wikibase-termbox' failed
12:29:32 Failed to clone 'view/lib/wikibase-termbox'. Retry scheduled
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/datavalues-javascript.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/datavalues-javascript.git' into submodule path '/src/view/lib/wikibase-data-values' failed
12:29:32 Failed to clone 'view/lib/wikibase-data-values' a second time, aborting
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-data-model.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-data-model.git' into submodule path '/src/view/lib/wikibase-data-model' failed
12:29:32 Failed to clone 'view/lib/wikibase-data-model' a second time, aborting
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-serialization.git/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-serialization.git' into submodule path '/src/view/lib/wikibase-serialization' failed
12:29:32 Failed to clone 'view/lib/wikibase-serialization' a second time, aborting
12:29:32 fatal: unable to access 'https://phabricator.wikimedia.org/source/wikibase-termbox/': The requested URL returned error: 500
12:29:32 fatal: clone of 'https://phabricator.wikimedia.org/source/wikibase-termbox' into submodule path '/src/view/lib/wikibase-termbox' failed
12:29:32 Failed to clone 'view/lib/wikibase-termbox' a second time, aborting
12:29:32 Build step 'Execute shell' marked build as failure
12:29:32 Archiving artifacts

@Lucas_Werkmeister_WMDE those errors come from cloning from Phabricator. I would recommend to instead use the canonical Gerrit URL.

Still seen:
https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/53486/console

02:18:34 fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/FileImporter/': The requested URL returned error: 502

Still seen:
https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/78885/console

15:40:27   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/skins/Vector/': The requested URL returned error: 502'
hashar added a subscriber: zeljkofilipin.

@zeljkofilipin reported a similar one using:

GIT_TRACE_CURL=1 GIT_TRACE_CURL_NO_DATA=1 git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/core.git mediawiki

1~$ GIT_TRACE_CURL=1 GIT_TRACE_CURL_NO_DATA=1 git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/core.git mediawiki
2Cloning into 'mediawiki'...
318:11:59.377481 http.c:731 == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
418:11:59.535149 http.c:731 == Info: Trying 208.80.154.137...
518:11:59.535178 http.c:731 == Info: TCP_NODELAY set
618:11:59.679604 http.c:731 == Info: Connected to gerrit.wikimedia.org (208.80.154.137) port 443 (#0)
718:11:59.679739 http.c:731 == Info: ALPN, offering h2
818:11:59.679748 http.c:731 == Info: ALPN, offering http/1.1
918:11:59.687143 http.c:731 == Info: successfully set certificate verify locations:
1018:11:59.687156 http.c:731 == Info: CAfile: /etc/ssl/cert.pem
11 CApath: none
1218:11:59.687245 http.c:731 == Info: TLSv1.2 (OUT), TLS handshake, Client hello (1):
1318:11:59.828884 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Server hello (2):
1418:11:59.829086 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Certificate (11):
1518:11:59.849901 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Server key exchange (12):
1618:11:59.850401 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Server finished (14):
1718:11:59.850696 http.c:731 == Info: TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
1818:11:59.850705 http.c:731 == Info: TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
1918:11:59.850742 http.c:731 == Info: TLSv1.2 (OUT), TLS handshake, Finished (20):
2018:11:59.994151 http.c:731 == Info: TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
2118:11:59.994282 http.c:731 == Info: TLSv1.2 (IN), TLS handshake, Finished (20):
2218:11:59.994311 http.c:731 == Info: SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
2318:11:59.994325 http.c:731 == Info: ALPN, server accepted to use http/1.1
2418:11:59.994331 http.c:731 == Info: Server certificate:
2518:11:59.994342 http.c:731 == Info: subject: CN=gerrit.wikimedia.org
2618:11:59.994348 http.c:731 == Info: start date: Aug 31 17:00:16 2021 GMT
2718:11:59.994353 http.c:731 == Info: expire date: Nov 29 17:00:15 2021 GMT
2818:11:59.994369 http.c:731 == Info: subjectAltName: host "gerrit.wikimedia.org" matched cert's "gerrit.wikimedia.org"
2918:11:59.994379 http.c:731 == Info: issuer: C=US; O=Let's Encrypt; CN=R3
3018:11:59.994383 http.c:731 == Info: SSL certificate verify ok.
3118:11:59.994480 http.c:678 => Send header, 0000000228 bytes (0x000000e4)
3218:11:59.994489 http.c:690 => Send header: GET /r/mediawiki/core.git/info/refs?service=git-upload-pack HTTP/1.1
3318:11:59.994493 http.c:690 => Send header: Host: gerrit.wikimedia.org
3418:11:59.994496 http.c:690 => Send header: User-Agent: git/2.24.3 (Apple Git-128)
3518:11:59.994500 http.c:690 => Send header: Accept: */*
3618:11:59.994503 http.c:690 => Send header: Accept-Encoding: deflate, gzip
3718:11:59.994506 http.c:690 => Send header: Pragma: no-cache
3818:11:59.994509 http.c:690 => Send header: Git-Protocol: version=2
3918:11:59.994512 http.c:690 => Send header:
4018:12:00.608040 http.c:678 <= Recv header, 0000000017 bytes (0x00000011)
4118:12:00.608081 http.c:690 <= Recv header: HTTP/1.1 200 OK
4218:12:00.608088 http.c:678 <= Recv header, 0000000037 bytes (0x00000025)
4318:12:00.608092 http.c:690 <= Recv header: Date: Fri, 08 Oct 2021 16:12:00 GMT
4418:12:00.608098 http.c:678 <= Recv header, 0000000016 bytes (0x00000010)
4518:12:00.608102 http.c:690 <= Recv header: Server: Apache
4618:12:00.608106 http.c:678 <= Recv header, 0000000074 bytes (0x0000004a)
4718:12:00.608118 http.c:690 <= Recv header: Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
4818:12:00.608123 http.c:678 <= Recv header, 0000000023 bytes (0x00000017)
4918:12:00.608127 http.c:690 <= Recv header: X-Frame-Options: DENY
5018:12:00.608131 http.c:678 <= Recv header, 0000000040 bytes (0x00000028)
5118:12:00.608134 http.c:690 <= Recv header: Expires: Tue, 01 Jan 1980 00:00:00 GMT
5218:12:00.608138 http.c:678 <= Recv header, 0000000018 bytes (0x00000012)
5318:12:00.608142 http.c:690 <= Recv header: Pragma: no-cache
5418:12:00.608146 http.c:678 <= Recv header, 0000000053 bytes (0x00000035)
5518:12:00.608150 http.c:690 <= Recv header: Cache-Control: no-cache, max-age=0, must-revalidate
5618:12:00.608154 http.c:678 <= Recv header, 0000000059 bytes (0x0000003b)
5718:12:00.608158 http.c:690 <= Recv header: Content-Type: application/x-git-upload-pack-advertisement
5818:12:00.608163 http.c:678 <= Recv header, 0000000020 bytes (0x00000014)
5918:12:00.608166 http.c:690 <= Recv header: Content-Length: 62
6018:12:00.608170 http.c:678 <= Recv header, 0000000043 bytes (0x0000002b)
6118:12:00.608174 http.c:690 <= Recv header: Backend-Timing: D=1688 t=1633709520077989
6218:12:00.608178 http.c:678 <= Recv header, 0000000002 bytes (0x00000002)
6318:12:00.608182 http.c:690 <= Recv header:
6418:12:00.608197 http.c:731 == Info: Connection #0 to host gerrit.wikimedia.org left intact
6518:12:00.608388 http.c:731 == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
6618:12:00.608424 http.c:731 == Info: Found bundle for host gerrit.wikimedia.org: 0x7ff7d6d04a80 [can pipeline]
6718:12:00.608429 http.c:731 == Info: Could pipeline, but not asked to!
6818:12:00.608444 http.c:731 == Info: Re-using existing connection! (#0) with host gerrit.wikimedia.org
6918:12:00.608451 http.c:731 == Info: Connected to gerrit.wikimedia.org (208.80.154.137) port 443 (#0)
7018:12:00.608516 http.c:678 => Send header, 0000000300 bytes (0x0000012c)
7118:12:00.608524 http.c:690 => Send header: POST /r/mediawiki/core.git/git-upload-pack HTTP/1.1
7218:12:00.608530 http.c:690 => Send header: Host: gerrit.wikimedia.org
7318:12:00.608534 http.c:690 => Send header: User-Agent: git/2.24.3 (Apple Git-128)
7418:12:00.608537 http.c:690 => Send header: Accept-Encoding: deflate, gzip
7518:12:00.608541 http.c:690 => Send header: Content-Type: application/x-git-upload-pack-request
7618:12:00.608545 http.c:690 => Send header: Accept: application/x-git-upload-pack-result
7718:12:00.608548 http.c:690 => Send header: Git-Protocol: version=2
7818:12:00.608551 http.c:690 => Send header: Content-Length: 122
7918:12:00.608554 http.c:690 => Send header:
8018:12:00.608560 http.c:731 == Info: upload completely sent off: 122 out of 122 bytes
8118:12:00.897137 http.c:678 <= Recv header, 0000000017 bytes (0x00000011)
8218:12:00.897157 http.c:690 <= Recv header: HTTP/1.1 200 OK
8318:12:00.897163 http.c:678 <= Recv header, 0000000037 bytes (0x00000025)
8418:12:00.897167 http.c:690 <= Recv header: Date: Fri, 08 Oct 2021 16:12:00 GMT
8518:12:00.897172 http.c:678 <= Recv header, 0000000016 bytes (0x00000010)
8618:12:00.897176 http.c:690 <= Recv header: Server: Apache
8718:12:00.897180 http.c:678 <= Recv header, 0000000074 bytes (0x0000004a)
8818:12:00.897184 http.c:690 <= Recv header: Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
8918:12:00.897188 http.c:678 <= Recv header, 0000000023 bytes (0x00000017)
9018:12:00.897192 http.c:690 <= Recv header: X-Frame-Options: DENY
9118:12:00.897206 http.c:678 <= Recv header, 0000000040 bytes (0x00000028)
9218:12:00.897210 http.c:690 <= Recv header: Expires: Tue, 01 Jan 1980 00:00:00 GMT
9318:12:00.897224 http.c:678 <= Recv header, 0000000018 bytes (0x00000012)
9418:12:00.897237 http.c:690 <= Recv header: Pragma: no-cache
9518:12:00.897243 http.c:678 <= Recv header, 0000000053 bytes (0x00000035)
9618:12:00.897247 http.c:690 <= Recv header: Cache-Control: no-cache, max-age=0, must-revalidate
9718:12:00.897252 http.c:678 <= Recv header, 0000000052 bytes (0x00000034)
9818:12:00.897256 http.c:690 <= Recv header: Content-Type: application/x-git-upload-pack-result
9918:12:00.897260 http.c:678 <= Recv header, 0000000045 bytes (0x0000002d)
10018:12:00.897264 http.c:690 <= Recv header: Backend-Timing: D=143034 t=1633709520688117
10118:12:00.897269 http.c:678 <= Recv header, 0000000028 bytes (0x0000001c)
10218:12:00.897281 http.c:690 <= Recv header: Transfer-Encoding: chunked
10318:12:00.897285 http.c:678 <= Recv header, 0000000002 bytes (0x00000002)
10418:12:00.897288 http.c:690 <= Recv header:
10518:12:01.181998 http.c:731 == Info: Connection #0 to host gerrit.wikimedia.org left intact
10618:12:01.186693 http.c:731 == Info: Couldn't find host gerrit.wikimedia.org in the .netrc file; using defaults
10718:12:01.186712 http.c:731 == Info: Found bundle for host gerrit.wikimedia.org: 0x7ff7d6d04a80 [can pipeline]
10818:12:01.186717 http.c:731 == Info: Could pipeline, but not asked to!
10918:12:01.186735 http.c:731 == Info: Re-using existing connection! (#0) with host gerrit.wikimedia.org
11018:12:01.186743 http.c:731 == Info: Connected to gerrit.wikimedia.org (208.80.154.137) port 443 (#0)
11118:12:01.186796 http.c:678 => Send header, 0000000300 bytes (0x0000012c)
11218:12:01.186805 http.c:690 => Send header: POST /r/mediawiki/core.git/git-upload-pack HTTP/1.1
11318:12:01.186809 http.c:690 => Send header: Host: gerrit.wikimedia.org
11418:12:01.186812 http.c:690 => Send header: User-Agent: git/2.24.3 (Apple Git-128)
11518:12:01.186824 http.c:690 => Send header: Accept-Encoding: deflate, gzip
11618:12:01.186828 http.c:690 => Send header: Content-Type: application/x-git-upload-pack-request
11718:12:01.186831 http.c:690 => Send header: Accept: application/x-git-upload-pack-result
11818:12:01.186835 http.c:690 => Send header: Git-Protocol: version=2
11918:12:01.186838 http.c:690 => Send header: Content-Length: 187
12018:12:01.186841 http.c:690 => Send header:
12118:12:01.186848 http.c:731 == Info: upload completely sent off: 187 out of 187 bytes
12218:12:01.825005 http.c:678 <= Recv header, 0000000017 bytes (0x00000011)
12318:12:01.825022 http.c:690 <= Recv header: HTTP/1.1 200 OK
12418:12:01.825029 http.c:678 <= Recv header, 0000000037 bytes (0x00000025)
12518:12:01.825033 http.c:690 <= Recv header: Date: Fri, 08 Oct 2021 16:12:01 GMT
12618:12:01.825038 http.c:678 <= Recv header, 0000000016 bytes (0x00000010)
12718:12:01.825042 http.c:690 <= Recv header: Server: Apache
12818:12:01.825046 http.c:678 <= Recv header, 0000000074 bytes (0x0000004a)
12918:12:01.825056 http.c:690 <= Recv header: Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
13018:12:01.825069 http.c:678 <= Recv header, 0000000023 bytes (0x00000017)
13118:12:01.825074 http.c:690 <= Recv header: X-Frame-Options: DENY
13218:12:01.825078 http.c:678 <= Recv header, 0000000040 bytes (0x00000028)
13318:12:01.825082 http.c:690 <= Recv header: Expires: Tue, 01 Jan 1980 00:00:00 GMT
13418:12:01.825087 http.c:678 <= Recv header, 0000000018 bytes (0x00000012)
13518:12:01.825090 http.c:690 <= Recv header: Pragma: no-cache
13618:12:01.825095 http.c:678 <= Recv header, 0000000053 bytes (0x00000035)
13718:12:01.825099 http.c:690 <= Recv header: Cache-Control: no-cache, max-age=0, must-revalidate
13818:12:01.825104 http.c:678 <= Recv header, 0000000052 bytes (0x00000034)
13918:12:01.825107 http.c:690 <= Recv header: Content-Type: application/x-git-upload-pack-result
14018:12:01.825112 http.c:678 <= Recv header, 0000000045 bytes (0x0000002d)
14118:12:01.825115 http.c:690 <= Recv header: Backend-Timing: D=495370 t=1633709521266171
14218:12:01.825120 http.c:678 <= Recv header, 0000000028 bytes (0x0000001c)
14318:12:01.825124 http.c:690 <= Recv header: Transfer-Encoding: chunked
14418:12:01.825128 http.c:678 <= Recv header, 0000000002 bytes (0x00000002)
14518:12:01.825140 http.c:690 <= Recv header:
146remote: Counting objects: 8957, done
147remote: Finding sources: 100% (8957/8957)
148remote: Getting sizes: 100% (7335/7335)
149remote: Compressing objects: 99% (93223/93224)
15018:17:13.944725 http.c:731 == Info: TLSv1.2 (IN), TLS alert, close notify (256):
15118:17:13.944745 http.c:731 == Info: transfer closed with outstanding read data remaining
15218:17:13.944760 http.c:731 == Info: Closing connection 0
15318:17:13.944787 http.c:731 == Info: TLSv1.2 (OUT), TLS alert, close notify (256):
154error: RPC failed; curl 18 transfer closed with outstanding read data remaining
155fatal: the remote end hung up unexpectedly
156fatal: early EOF
157fatal: index-pack failed

Which on the backend side is:

2021-10-08T15:40:09.554Z] [HTTP POST /r/mediawiki/core.git/git-upload-pack (N/A from x.x.x.x)] WARN  org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/core.git
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
        at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
        at org.eclipse.jetty.server.HttpOutput.channelWrite(HttpOutput.java:269)
        at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:861)
        at net.bull.javamelody.internal.web.FilterServletOutputStream.write(FilterServletOutputStream.java:88)
        at net.bull.javamelody.internal.web.CounterResponseStream.write(CounterResponseStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:123)
        at org.eclipse.jgit.util.TemporaryBuffer.write(TemporaryBuffer.java:128)
        at org.eclipse.jgit.transport.UploadPack$ResponseBufferedOutputStream.write(UploadPack.java:2371)
        at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:141)
        at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:120)
        at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:99)
        at java.base/java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
        at java.base/java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.eclipse.jgit.lib.ObjectLoader.copyTo(ObjectLoader.java:237)
...

The related Gerrit setting is https://gerrit.wikimedia.org/r/Documentation/config-gerrit.html#httpd.idleTimeout

httpd.idleTimeout

Maximum idle time for a connection, which roughly translates to the TCP socket SO_TIMEOUT.

This value is interpreted as the maximum time between some progress being made on the connection. So if a single byte is read or written, then the timeout is reset.

The max idle time is applied:

  • When waiting for a new message to be received on a connection
  • When waiting for a new message to be sent on a connection

By default, 30 seconds.

The issue was due to the proxy config and the other error Zeljko is encountered does not seem related. I will file another task.

15:57:05 INFO:zuul.Cloner.mediawiki/extensions/CirrusSearch:Creating repo mediawiki/extensions/CirrusSearch from upstream https://gerrit.wikimedia.org/r/mediawiki/extensions/CirrusSearch
...
15:57:11 git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
15:57:11   cmdline: git fetch --tags -v origin
15:57:11   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/CirrusSearch/': The requested URL returned error: 502'

From apache around that time:

/var/log/apache2/gerrit.wikimedia.org.https.error.log
[Thu Dec 16 15:57:09.680632 2021] [proxy_http:error] [pid 8030:tid 139643982952192] (20014)Internal error (specific information not available): [client 172.16.6.197:54922] AH01102: error reading status line from remote server 127.0.0.1:80
80
[Thu Dec 16 15:57:09.680713 2021] [proxy:error] [pid 8030:tid 139643982952192] [client 172.16.6.197:54922] AH00898: Error reading from remote server returned by /r/mediawiki/extensions/CirrusSearch/info/refs

That is the IP of the instance that ran the build and the error does refer to CirrusSearch which failed to clone. The same happened for AbuseFilter at 16:10:33.

There a couple errors before that:

[2021-12-16T15:50:18.859Z] [HTTP POST /r/mediawiki/vendor/git-upload-pack (N/A from 172.16.3.126)] WARN org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/vendor.git
org.eclipse.jetty.io.EofException
[2021-12-16T15:50:19.197Z] [HTTP POST /r/mediawiki/vendor/git-upload-pack (N/A from 172.16.7.210)] WARN org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /srv/gerrit/git/mediawiki/vendor.git
org.eclipse.jetty.io.EofException

But I think that is due to build being aborted.

Maybe there is an issue with the Jetty pool but I don't see anything on https://grafana.wikimedia.org/d/fe848RoMz/http-jetty it had plenty of idle slot.

For Apache the pool is at https://grafana.wikimedia.org/d/L0-l1o0Mz/apache?orgId=1&refresh=1m&var-host=gerrit1001&var-port=9117 and I dont see much either there

I don't have any idea as to how to debug Apache mod_proxy and the AH00898 error :-\

https://integration.wikimedia.org/ci/job/mwext-phpunit-coverage-docker-publish/52844/console is another instance.

05:07:50   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/AntiSpoof/': The requested URL returned error: 502'
10:32:51   stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/vendor/': The requested URL returned error: 502'

https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/172239/console