Page MenuHomePhabricator

Zuul jenkins-bot user holding open SSH sessions
Open, LowPublic

Description

Apparent root cause of T308943: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos and T309371: Gerrit: all patches are being reported as merge conflicts

Max connection count for user jenkins-bot exceeded, rejecting new connection. currentSessionCount = 4, maxSessionCount = 4

Gerrit itself itself shows 3 open connections:

$ ssh -p 29418 gerrit.wikimedia.org gerrit show-connections | grep jenkins
716fc32f   jenkins-bot     contint2001.wikimedi
944d965d   jenkins-bot     contint2001.wikimedi
f45092f9   jenkins-bot     contint2001.wikimedi

@hashar took a stack dump (kill -SIGUSR2 <pidof zuul>)

12022-07-12 08:59:01,193 DEBUG zuul.stack_dump: Thread: 140118451013376
2 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
3 self.__bootstrap_inner()
4 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
5 self.run()
6 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1780, in run
7 ptype, m = self.packetizer.read_message()
8 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/packet.py", line 391, in read_message
9 header = self.read_all(self.__block_size_in, check_rekey=True)
10 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/packet.py", line 254, in read_all
11 x = self.__socket.recv(n)
12Thread: 140117645707008
13 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
14 self.__bootstrap_inner()
15 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
16 self.run()
17 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1780, in run
18 ptype, m = self.packetizer.read_message()
19 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/packet.py", line 391, in read_message
20 header = self.read_all(self.__block_size_in, check_rekey=True)
21 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/packet.py", line 254, in read_all
22 x = self.__socket.recv(n)
23Thread: 140117939320576
24 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
25 self.__bootstrap_inner()
26 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
27 self.run()
28 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 210, in run
29 self._run()
30 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 186, in _run
31 self._listen(stdout, stderr)
32 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 166, in _listen
33 ret = poll.poll(self.poll_timeout)
34Thread: 140117930927872
35 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
36 self.__bootstrap_inner()
37 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
38 self.run()
39 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 134, in run
40 self._handleEvent()
41 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 60, in _handleEvent
42 time.sleep(max((ts + self.delay) - now, 0.0))
43Thread: 140117922535168
44 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
45 self.__bootstrap_inner()
46 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
47 self.run()
48 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/transport.py", line 1780, in run
49 ptype, m = self.packetizer.read_message()
50 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/packet.py", line 391, in read_message
51 header = self.read_all(self.__block_size_in, check_rekey=True)
52 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paramiko/packet.py", line 254, in read_all
53 x = self.__socket.recv(n)
54Thread: 140118476191488
55 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
56 self.__bootstrap_inner()
57 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
58 self.run()
59 File "/usr/lib/python2.7/threading.py", line 754, in run
60 self.__target(*self.__args, **self.__kwargs)
61 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
62 runnable = self.queue.get()
63 File "/usr/lib/python2.7/Queue.py", line 168, in get
64 self.not_empty.wait()
65 File "/usr/lib/python2.7/threading.py", line 340, in wait
66 waiter.acquire()
67Thread: 140119038207744
68 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
69 self.__bootstrap_inner()
70 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
71 self.run()
72 File "/usr/lib/python2.7/threading.py", line 754, in run
73 self.__target(*self.__args, **self.__kwargs)
74 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
75 runnable = self.queue.get()
76 File "/usr/lib/python2.7/Queue.py", line 168, in get
77 self.not_empty.wait()
78 File "/usr/lib/python2.7/threading.py", line 340, in wait
79 waiter.acquire()
80Thread: 140118425835264
81 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
82 self.__bootstrap_inner()
83 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
84 self.run()
85 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 968, in run
86 self.wake_event.wait()
87 File "/usr/lib/python2.7/threading.py", line 614, in wait
88 self.__cond.wait(timeout)
89 File "/usr/lib/python2.7/threading.py", line 340, in wait
90 waiter.acquire()
91Thread: 140118434227968
92 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
93 self.__bootstrap_inner()
94 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
95 self.run()
96 File "/usr/lib/python2.7/threading.py", line 754, in run
97 self.__target(*self.__args, **self.__kwargs)
98 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
99 runnable = self.queue.get()
100 File "/usr/lib/python2.7/Queue.py", line 168, in get
101 self.not_empty.wait()
102 File "/usr/lib/python2.7/threading.py", line 340, in wait
103 waiter.acquire()
104Thread: 140117637314304
105 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
106 self.__bootstrap_inner()
107 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
108 self.run()
109 File "/usr/lib/python2.7/threading.py", line 754, in run
110 self.__target(*self.__args, **self.__kwargs)
111 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
112 runnable = self.queue.get()
113 File "/usr/lib/python2.7/Queue.py", line 168, in get
114 self.not_empty.wait()
115 File "/usr/lib/python2.7/threading.py", line 340, in wait
116 waiter.acquire()
117Thread: 140117914142464
118 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
119 self.__bootstrap_inner()
120 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
121 self.run()
122 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/webapp.py", line 61, in run
123 self.server.serve_forever()
124 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 1084, in serve_forever
125 self.handle_request()
126 File "/usr/lib/python2.7/SocketServer.py", line 274, in handle_request
127 fd_sets = _eintr_retry(select.select, [self], [], [], timeout)
128 File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry
129 return func(*args)
130Thread: 140118459406080
131 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
132 self.__bootstrap_inner()
133 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
134 self.run()
135 File "/usr/lib/python2.7/threading.py", line 754, in run
136 self.__target(*self.__args, **self.__kwargs)
137 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
138 runnable = self.queue.get()
139 File "/usr/lib/python2.7/Queue.py", line 168, in get
140 self.not_empty.wait()
141 File "/usr/lib/python2.7/threading.py", line 340, in wait
142 waiter.acquire()
143Thread: 140117662492416
144 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
145 self.__bootstrap_inner()
146 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
147 self.run()
148 File "/usr/lib/python2.7/threading.py", line 754, in run
149 self.__target(*self.__args, **self.__kwargs)
150 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 769, in _doConnectLoop
151 self.connections_condition.wait()
152 File "/usr/lib/python2.7/threading.py", line 340, in wait
153 waiter.acquire()
154Thread: 140117670885120
155 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
156 self.__bootstrap_inner()
157 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
158 self.run()
159 File "/usr/lib/python2.7/threading.py", line 754, in run
160 self.__target(*self.__args, **self.__kwargs)
161 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 855, in _doPollLoop
162 self._pollLoop()
163 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 881, in _pollLoop
164 ret = poll.poll()
165Thread: 140119046600448
166 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
167 self.__bootstrap_inner()
168 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
169 self.run()
170 File "/usr/lib/python2.7/threading.py", line 754, in run
171 self.__target(*self.__args, **self.__kwargs)
172 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
173 runnable = self.queue.get()
174 File "/usr/lib/python2.7/Queue.py", line 168, in get
175 self.not_empty.wait()
176 File "/usr/lib/python2.7/threading.py", line 340, in wait
177 waiter.acquire()
178Thread: 140118467798784
179 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
180 self.__bootstrap_inner()
181 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
182 self.run()
183 File "/usr/lib/python2.7/threading.py", line 754, in run
184 self.__target(*self.__args, **self.__kwargs)
185 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
186 runnable = self.queue.get()
187 File "/usr/lib/python2.7/Queue.py", line 168, in get
188 self.not_empty.wait()
189 File "/usr/lib/python2.7/threading.py", line 340, in wait
190 waiter.acquire()
191Thread: 140117654099712
192 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
193 self.__bootstrap_inner()
194 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
195 self.run()
196 File "/usr/lib/python2.7/threading.py", line 754, in run
197 self.__target(*self.__args, **self.__kwargs)
198 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/rpclistener.py", line 68, in run
199 job = self.worker.getJob()
200 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 2092, in getJob
201 job = self.job_queue.get()
202 File "/usr/lib/python2.7/Queue.py", line 168, in get
203 self.not_empty.wait()
204 File "/usr/lib/python2.7/threading.py", line 340, in wait
205 waiter.acquire()
206Thread: 140118442620672
207 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
208 self.__bootstrap_inner()
209 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
210 self.run()
211 File "/usr/lib/python2.7/threading.py", line 754, in run
212 self.__target(*self.__args, **self.__kwargs)
213 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
214 runnable = self.queue.get()
215 File "/usr/lib/python2.7/Queue.py", line 168, in get
216 self.not_empty.wait()
217 File "/usr/lib/python2.7/threading.py", line 340, in wait
218 waiter.acquire()
219Thread: 140119187257152
220 File "/srv/deployment/zuul/venv/bin/zuul-server", line 10, in <module>
221 sys.exit(main())
222 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 239, in main
223 server.main()
224 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 211, in main
225 signal.pause()
226 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/cmd/__init__.py", line 42, in stack_dump_handler
227 log_str += "".join(traceback.format_stack(stack_frame))
228Thread: 140119054993152
229 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
230 self.__bootstrap_inner()
231 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
232 self.run()
233 File "/usr/lib/python2.7/threading.py", line 754, in run
234 self.__target(*self.__args, **self.__kwargs)
235 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
236 runnable = self.queue.get()
237 File "/usr/lib/python2.7/Queue.py", line 168, in get
238 self.not_empty.wait()
239 File "/usr/lib/python2.7/threading.py", line 340, in wait
240 waiter.acquire()
241Thread: 140119029815040
242 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
243 self.__bootstrap_inner()
244 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
245 self.run()
246 File "/usr/lib/python2.7/threading.py", line 754, in run
247 self.__target(*self.__args, **self.__kwargs)
248 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/paste/httpserver.py", line 866, in worker_thread_callback
249 runnable = self.queue.get()
250 File "/usr/lib/python2.7/Queue.py", line 168, in get
251 self.not_empty.wait()
252 File "/usr/lib/python2.7/threading.py", line 340, in wait
253 waiter.acquire()
254Thread: 140119080171264
255 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
256 self.__bootstrap_inner()
257 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
258 self.run()
259 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/launcher/gearman.py", line 47, in run
260 self.wake_event.wait(300)
261 File "/usr/lib/python2.7/threading.py", line 614, in wait
262 self.__cond.wait(timeout)
263 File "/usr/lib/python2.7/threading.py", line 359, in wait
264 _sleep(delay)
265Thread: 140119167018752
266 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
267 self.__bootstrap_inner()
268 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
269 self.run()
270 File "/usr/lib/python2.7/threading.py", line 754, in run
271 self.__target(*self.__args, **self.__kwargs)
272 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 855, in _doPollLoop
273 self._pollLoop()
274 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 881, in _pollLoop
275 ret = poll.poll()
276Thread: 140119063385856
277 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
278 self.__bootstrap_inner()
279 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
280 self.run()
281 File "/usr/lib/python2.7/threading.py", line 754, in run
282 self.__target(*self.__args, **self.__kwargs)
283 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 769, in _doConnectLoop
284 self.connections_condition.wait()
285 File "/usr/lib/python2.7/threading.py", line 340, in wait
286 waiter.acquire()
287Thread: 140119071778560
288 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
289 self.__bootstrap_inner()
290 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
291 self.run()
292 File "/usr/lib/python2.7/threading.py", line 754, in run
293 self.__target(*self.__args, **self.__kwargs)
294 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 855, in _doPollLoop
295 self._pollLoop()
296 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 881, in _pollLoop
297 ret = poll.poll()
298Thread: 140119158626048
299 File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
300 self.__bootstrap_inner()
301 File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
302 self.run()
303 File "/usr/lib/python2.7/threading.py", line 754, in run
304 self.__target(*self.__args, **self.__kwargs)
305 File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/gear/__init__.py", line 769, in _doConnectLoop
306 self.connections_condition.wait()
307 File "/usr/lib/python2.7/threading.py", line 340, in wait
308 waiter.acquire()
309

The three connections:

lsof -p 15751|grep gerrit
zuul-serv 15751 zuul    8u  IPv6          295859537       0t0       TCP contint2001.wikimedia.org:57134->gerrit.wikimedia.org:29418 (ESTABLISHED)
zuul-serv 15751 zuul   17u  IPv6          173274822       0t0       TCP contint2001.wikimedia.org:39456->gerrit.wikimedia.org:29418 (ESTABLISHED)
zuul-serv 15751 zuul   24u  IPv6          295859542       0t0       TCP contint2001.wikimedia.org:57136->gerrit.wikimedia.org:29418 (ESTABLISHED)

We should probably figure out why it's happening, and maybe see if setting up an icinga alert is possible?

Related Objects

Event Timeline

I did a bit of digging into what metrics are available in Prometheus for this, so an Alertmanager alert could maybe be set up — gerrit's SSH session metric looked promising (thanos link)

Mentioned in SAL (#wikimedia-operations) [2022-12-05T11:07:23Z] <hashar> Restarted Zuul to clear a stuck ssh connection with Gerrit - T309376

From the last occurrence, on the Gerrit server side:

17:02:08.916 exceptionCaught(ServerSessionImpl[jenkins-bot@/2620:0:860:1:208:80:153:15:44828])[state=Opened] IOException: Connection reset by peer
17:04:26.226 Max connection count for user jenkins-bot exceeded, rejecting new connection. currentSessionCount = 4, maxSessionCount = 4

From the sshd_log the last few queries received:

[2023-04-10T16:59:31.691Z] 60a5f7f7 [SSH gerrit query --format json --commit-message --current-patch-set message:I937e905d914747ae3724c94c58a5f1a271771937 (jenkins-bot)] jenkins-bot a/75 ger
rit.query.--format.json.--commit-message.--current-patch-set.message:I937e905d914747ae3724c94c58a5f1a271771937 4ms 5ms - 0 - 4ms 10ms 665248
[2023-04-10T17:01:33.545Z] 60a5f7f7 [SSH gerrit review --project mediawiki/core --message Performance checks OK! (same or better) [trimmed] --tag autogenerated:ci-patch-performance 891874,76 (jenkins-bot)] jenkins-bot a/75 gerrit.review.--project.mediawiki/core.--message.Performance checks OK! (same or better)
...
[2023-04-10T17:01:38.677Z] 60a5f7f7 [SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:891874 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.change:891874 5ms 51ms - 0 - 48ms 40ms 40035104
[2023-04-10T17:02:08.916Z] 60a5f7f7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-4] jenkins-bot a/75 LOGOUT

So Gerrit lost the remote connection (Connection reset by peer and supposedly might have considered the user to be logged out) but somehow it is still considered logged in? I went to check the logging in and logging out of sessions:

$ diff --color -U 0 \
 <(grep 'jenkins-bot a/75 LOGIN' /var/log/gerrit/sshd_log.2023-04-10 |awk '{print $2}'|sort|uniq) \
 <(grep 'jenkins-bot a/75 LOGOUT' /var/log/gerrit/sshd_log.2023-04-10 |awk '{print $2}'|sort|uniq)
--- LOGIN  2023-04-11 08:15:45.247393537 +0000
+++ LOGOUT 2023-04-11 08:15:36.495335458 +0000
@@ -2389,0 +2390 @@
+4ae0a98f
@@ -3121,0 +3123 @@
+60a5f7f7
@@ -3682 +3683,0 @@
-70d9fc23
@@ -7733 +7733,0 @@
-f0808c15
$

The one with no LOGOUT is Zuul reconnecting to Gerrit after the restart. One thread is for gerrit stream-events the other is a long lived thread to emit gerrit query in order to retrieve metadata bout the changes.

For the one that had no LOGIN:

4ae0a98f is the stream-events thread which ran for 12.5 days:

[2023-04-10T20:00:34.678Z] 4ae0a98f [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-5] jenkins-bot a/75 LOGOUT
[2023-04-10T20:00:34.678Z] 4ae0a98f [SshCommandDestroy-0] jenkins-bot a/75 gerrit.stream-events 0ms 1084426203ms - 0 - 12435ms 8890ms 84540184

60a5f7f7

It logged in on March 29th or ~ 12 days ago at the same time as the stream-event event thread:

sshd_log.2023-03-29.gz:[2023-03-29T06:48:48.741Z] 60a5f7f7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-25] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15
sshd_log.2023-03-29.gz:[2023-03-29T06:46:48.356Z] 4ae0a98f [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-32] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15

It the long lived connection to issue gerrit query and the session finishes with:

[2023-04-10T16:59:31.691Z] 60a5f7f7 [SSH gerrit query --format json --commit-message --current-patch-set message:I937e905d914747ae3724c94c58a5f1a271771937 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--commit-message.--current-patch-set.message:I937e905d914747ae3724c94c58a5f1a271771937 4ms 5ms - 0 - 4ms 10ms 665248
[2023-04-10T17:01:33.545Z] 60a5f7f7 [SSH gerrit review --project mediawiki/core --message Performance checks OK! (same or better) [trimmed] --tag autogenerated:ci-patch-performance 891874,76 (jenkins-bot)] jenkins-bot a/75 gerrit.review.--project.mediawiki/core.--message.Performance checks OK! (same or better)
[2023-04-10T17:01:38.677Z] 60a5f7f7 [SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:891874 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.change:891874 5ms 51ms - 0 - 48ms 40ms 40035104
[2023-04-10T17:02:08.916Z] 60a5f7f7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-4] jenkins-bot a/75 LOGOUT

Which matches the connection reset by peer on the server side, but somehow the Gerrit server does not properly remove the connection tracking and still consider it logged in? That sounds like it is the bug Gerrit is facing, maybe because it mishandle the IOException.

The logout happens 30 seconds after the last command, so maybe something is timing out?

On Zuul side (/var/log/zuul/debug.log):

2023-04-10 17:01:38,573 INFO zuul.source.Gerrit: Updating <Change 0x7fcce9de7a10 891874,76>

Which I guess has completed and is the Gerrit server side:

[2023-04-10T17:01:38.677Z] 60a5f7f7 [SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:891874 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.change:891874 5ms 51ms - 0 - 48ms 40ms 40035104

(as a side note, the sshd_log fields are described at https://gerrit.wikimedia.org/r/Documentation/logs.html#_sshd_log ).

Then Zuul has:

2023-04-10 17:01:38,953 DEBUG zuul.source.Gerrit: Updating <Change 0x7fcce9de7a10 891874,76>: Running query message:I9c8335a9e9d6236209721d79f38d91b52310e786 to find changes needed-by

And that one fails 30 seconds later. At that point, Zuul still listens and receive events from Gerrit stream-events but I think the thread for queries is stall.

Then Zuul queries for another change:

2023-04-10 17:02:14,741 DEBUG zuul.source.Gerrit: Updating <Change 0x7fcd11674dd0 907480,2>: Getting git-dependent change 907395,1
2023-04-10 17:02:14,741 DEBUG zuul.source.Gerrit: Updating <Change 0x7fcd11674dd0 907480,2>: Running query message:I937e905d914747ae3724c94c58a5f1a271771937 to find changes needed-by

Which on Gerrit side is:

[2023-04-10T17:02:09.393Z] 66b813d7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-31] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15
[2023-04-10T17:02:09.511Z] 66b813d7 [SSH gerrit query --format json --commit-message --current-patch-set message:I9c8335a9e9d6236209721d79f38d91b52310e786 (jenkins-bot)] jenkins-bot a/75 gerrit.query.--format.json.--commit-message.--current-patch-set.message:I9c8335a9e9d6236209721d79f38d91b52310e786 5ms 6ms - 0 - 5ms 10ms 920864
[2023-04-10T20:00:34.678Z] 66b813d7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-23] jenkins-bot a/75 LOGOUT

So that one did a single connection and was then long lived until the restart. I think that is the "rogue" connection uselessly consuming a connection.

When the first Gerrit query connection thread got disconnected, Zuul connects TWO new threads:

[2023-04-10T17:02:08.916Z] 60a5f7f7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-4] jenkins-bot a/75 LOGOUT
[2023-04-10T17:02:09.393Z] 66b813d7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-31] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15
[2023-04-10T17:02:09.467Z] 46b38ffd [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-24] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15

The first 66b813d7 is the "rogue" thread, the second 46b38ffd is also long lived and is the one being used for all queries until it is logged out by the restart:

[2023-04-10T17:02:09.467Z] 46b38ffd [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-24] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15
[2023-04-10T17:02:09.599Z] 46b38ffd [SSH gerrit review --project mediawiki/extensions/WikiLambda --message Main test build succeeded. [trimmed] --tag autogenerated:ci-test --verified 2 907480,2 (jenkins-bot)] jenkins-bot a/75 gerrit.review.--project.mediawiki/extensions/WikiLambda.--message.Main test build succeeded.

<Followed by various gerrit.query requests>

[2023-04-10T20:00:34.678Z] 46b38ffd [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-2] jenkins-bot a/75 LOGOUT

The WikiLambda test report comes from an earlier request to report which took a bit of time to report due to a lack of connection and it being reestablished.

2023-04-10 17:01:45,645 DEBUG zuul.reporter.gerrit.Reporter: Report change <Change 0x7fcd11674dd0 907480,2>, params {'tag': 'autogenerated:ci-test', 'verified': 2}, message: Main test build succeeded.
...
2023-04-10 17:02:09,632 DEBUG zuul.IndependentPipelineManager: Removing change <Change 0x7fcd11674dd0 907480,2> from queue

So the issue would be in Zuul, most probably a race condition / lack of lock between threads (the second one that keeps being used does a gerrit review).

I imagine that on disconnect, two requests are in contention to reestablish the connection, the first stays established (it is intended to be longlived) but the variable holding it receives another connection which is then used from now on. That might be a bug somewhere in zuul/source/gerrit.py (basing that on the debug messages using the zuul.source.Gerrit log bucket).

Reporting a change comes from the reporting queue which I believe is managed by its own thread. Querying change is done from the scheduler. The low level implementation for sending a reviewing or querying metadata of changes boils down to zuul.connection.gerrit:

zuul/connection/gerrit.py
def _open(self):
    client = paramiko.SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(paramiko.WarningPolicy())
    client.connect(self.server,
                   username=self.user,
                   port=self.port,
                   key_filename=self.keyfile)
    self.client = client

def _ssh(self, command, stdin_data=None):
    if not self.client:
        self._open()

So I guess when the ssh connection resets, the client vanishes. The reporting thread and the change querying thread both invoke _ssh and since there is no self.client they both open a ssh connection which would explain the two connections being seen on Gerrit side:

[2023-04-10T17:02:08.916Z] 60a5f7f7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-4] jenkins-bot a/75 LOGOUT
[2023-04-10T17:02:09.393Z] 66b813d7 [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-31] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15
[2023-04-10T17:02:09.467Z] 46b38ffd [sshd-SshDaemon[2bd2fa2c](port=22)-nio2-thread-24] jenkins-bot a/75 LOGIN FROM 2620:0:860:1:208:80:153:15

Assuming the underlying connection is shared between the threads, when ssh fails both threads would reattempt the connection and one of the client is overridden cause it is not thread safe / lacks locking.

hashar renamed this task from gerrit-bot holding open SSH sessions to Zuul jenkins-bot user holding open SSH sessions.May 3 2023, 9:26 AM

If my theory about a race condition between the source and reporter stands true, we might workaround it by using two different connections. If I remember properly, the connections instances are stored in a dict with keys defined from:

/etc/zuul/zuul-server.conf
[connection gerrit]
driver=gerrit
server=gerrit.wikimedia.org
user=jenkins-bot
baseurl=https://gerrit.wikimedia.org/r
sshkey=/var/lib/zuul/.ssh/id_rsa
event_delay=5

Internally that would look like:

`
{
  "gerrit": <GerritConnection>
}

Then in Zuul, when establishing:

  • the source connection: the pipelines have source: gerrit which looks up the connection in the dict using the key gerrit and yield that GerritConnection.
  • the reporting: the pipelines have success: { gerrit: ... } which receives the same connection.

If we went to use two different keys (gerrit-stream-events and gerrit-reporter) that would shard them in the dict and prevent the race. The layout would need adjustment:

  pipelines:
    - name: test
-      source: gerrit
+      source: gerrit-stream-events
    success:
-     gerrit:
+     gerrit-reporter:
        verified: 2
        tag: autogenerated:ci-test

The Zuul merger uses /etc/zuul/zuul-merger.conf which is generated by Puppet from the same file. The connection to use is given by the merge item via the field connection_name which should be the source and hence gerrit-stream-events in the example above.

It is probably easier to avoid renaming the source connection and only rename the reporting one:

  • add a new gerrit-reporter connection using the same config
  • adjust the layout for success / failure to switch from gerrit to gerrit-reporter

Change 924884 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: add a gerrit-reporter gerrit connection

https://gerrit.wikimedia.org/r/924884

Change 924900 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] zuul: report using a different connection

https://gerrit.wikimedia.org/r/924900

Mentioned in SAL (#wikimedia-operations) [2023-06-05T13:25:54Z] <hashar> Restarted Zuul due to stall ssh connection # T309376

Change 924884 merged by Dzahn:

[operations/puppet@production] zuul: add a gerrit-reporter gerrit connection

https://gerrit.wikimedia.org/r/924884

The Zuul scheduler configuration now has a gerrit-reporter connection defined. Since it is not referred to from the config, it is not used yet.

Tomorrow I will deploy the layout change to switch Zuul to use that connection for reporting (that is https://gerrit.wikimedia.org/r/c/integration/config/+/924900/ ).

Mentioned in SAL (#wikimedia-releng) [2023-06-14T07:39:05Z] <hashar> Reloaded Zuul with no change to potentially account for the new [connection gerrit-reporter] added by https://gerrit.wikimedia.org/r/c/operations/puppet/+/924884 # T309376

Change 924900 merged by jenkins-bot:

[integration/config@master] zuul: report using a different connection

https://gerrit.wikimedia.org/r/924900

The layout change did not work upon deployment:

2023-06-14 07:46:49,387 ERROR zuul.Server: Reconfiguration failed:
Traceback (most recent call last):
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 67, in reconfigure_handler
    self.sched.reconfigure(self.config)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 716, in reconfigure
    event.wait()
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 141, in wait
    six.reraise(*self._exc_info)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1054, in process_management_queue
    self._doReconfigureEvent(event)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 816, in _doReconfigureEvent
    self.config.get('zuul', 'layout_config'), self.connections)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 407, in _parseConfig
    validator.validate(data, connections)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/layoutvalidator.py", line 329, in validate
    schema(data)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/voluptuous.py", line 215, in __call__
    return self._compiled([], data)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/voluptuous.py", line 428, in validate_dict
    return base_validate(path, iteritems(data), out)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/voluptuous.py", line 308, in validate_mapping
    raise MultipleInvalid(errors)
MultipleInvalid: extra keys not allowed @ data['pipelines'][0]['success']['gerrit-reporter']

I guess Zuul has to be restarted to take in account modifications made to zuul.conf which requires a hard restart.

Mentioned in SAL (#wikimedia-operations) [2023-06-14T08:51:12Z] <hashar> Restarting Zuul to apply config change for T309376

Change 929991 had a related patch set uploaded (by Hashar; author: Hashar):

[test/gerrit-ping@master] Dummy testing for T309376

https://gerrit.wikimedia.org/r/929991

Change 929970 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "zuul: report using a different connection"

https://gerrit.wikimedia.org/r/929970

Change 929970 merged by Hashar:

[integration/config@master] Revert "zuul: report using a different connection"

https://gerrit.wikimedia.org/r/929970

I rolled back the layout config change since upon starting the Zuul scheduler debug log had a few of either:

2023-06-14 08:52:34,772 INFO zuul.Scheduler: Adding operations/puppet, <Change 0x7fde48760cd0 929989,2> to <Pipeline test>
2023-06-14 08:52:34,772 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x7fde48760cd0 929989,2>
2023-06-14 08:52:34,772 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x7fde48760cd0 929989,2>
2023-06-14 08:52:34,776 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 986, in run
    self.process_event_queue()
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1044, in process_event_queue
    pipeline.manager.addChange(change)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1409, in addChange
    with self.getChangeQueue(change, change_queue) as change_queue:
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1940, in getChangeQueue
    change_queue.addProject(change.project)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/model.py", line 338, in addProject
    names = [x.name for x in self.projects]
AttributeError: 'NoneType' object has no attribute 'name'
2023-06-14 08:53:17,452 INFO zuul.Scheduler: Adding test/gerrit-ping, <Change 0x7fde480a5d10 929991,1> to <Pipeline test>
2023-06-14 08:53:17,452 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x7fde480a5d10 929991,1>
2023-06-14 08:53:17,453 ERROR zuul.Scheduler: Exception in run handler: 
Traceback (most recent call last):
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 986, in run
    self.process_event_queue()
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1044, in process_event_queue
    pipeline.manager.addChange(change)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1409, in addChange
    with self.getChangeQueue(change, change_queue) as change_queue:
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1937, in getChangeQueue
    if change.project not in self.pipeline.getProjects():
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/model.py", line 117, in getProjects
    key=lambda p: p.name)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/model.py", line 117, in <lambda>
    key=lambda p: p.name)
AttributeError: 'NoneType' object has no attribute 'name'

Then after rolling back and restarting Zuul, the same stack trace occurs :/

Change 929971 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Revert "zuul: add a gerrit-reporter gerrit connection"

https://gerrit.wikimedia.org/r/929971

Mentioned in SAL (#wikimedia-operations) [2023-06-14T09:11:09Z] <hashar> zuul: rolled back config changes for T309376 and restarted Zuul. CI is back up.

Change 929971 merged by Clément Goubert:

[operations/puppet@production] Revert "zuul: add a gerrit-reporter gerrit connection"

https://gerrit.wikimedia.org/r/929971

After I reverted the layout.yaml change introducing gerrit-reporter, Zuul was still not happy about it. So I have also reverted the zuul.conf Puppet patch which introduced the [connection gerrit-reporter] and that fixed it after restarting Zuul.

I have copied the debug.log on contint2001.wikimedia.org to /var/log/zuul/debug-T309376.log

Looking at the processed configuration output for the test-prio pipeline, the source is set to gerrit and the reporters are set to gerrit-reporter connection

2023-06-14 08:52:13,198 INFO zuul.IndependentPipelineManager:
Configured Pipeline Manager test-prio
Source: <GerritSource connection: gerrit://gerrit>
Requirements:
Events:
  <EventFilter ...>
  <EventFilter ...>
Projects:
  ...
On start:
  []
On success:
  [<GerritReporter connection: gerrit://gerrit-reporter>]
On failure:
  [<GerritReporter connection: gerrit://gerrit-reporter>]
On merge-failure:
  [<GerritReporter connection: gerrit://gerrit-reporter>]
When disabled:
  []

The scheduler then settles, receives a trigger event, processes it and fail with the above trace

2023-06-14 08:52:34,772 INFO zuul.Scheduler: Adding operations/puppet, <Change 0x7fde48760cd0 929989,2> to <Pipeline test>
2023-06-14 08:52:34,772 DEBUG zuul.IndependentPipelineManager: Considering adding change <Change 0x7fde48760cd0 929989,2>
2023-06-14 08:52:34,776 ERROR zuul.Scheduler: Exception in run handler:
Traceback (most recent call last):
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 986, in run
    self.process_event_queue()
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1044, in process_event_queue
    pipeline.manager.addChange(change)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1409, in addChange
    with self.getChangeQueue(change, change_queue) as change_queue:
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/scheduler.py", line 1940, in getChangeQueue
    change_queue.addProject(change.project)
  File "/srv/deployment/zuul/venv/local/lib/python2.7/site-packages/zuul/model.py", line 338, in addProject
    names = [x.name for x in self.projects]
AttributeError: 'NoneType' object has no attribute 'name'
2023-06-14 08:52:34,825 DEBUG zuul.Scheduler: Run handler sleeping
2023-06-14 08:52:34,825 DEBUG zuul.Scheduler: Run handler awake

:-(

I have reproduced it locally, the job tree has a None key which is the cause of the issue:

'job_trees': {None: <zuul.model.JobTree object at 0x7fef54686750>,
              <Project myrepo>: <zuul.model.JobTree object at 0x7fef547ae9d0>},

That does not (or not always) happen if I set on the connection event_delay = 1 but does with event_delay = 10.

Mentioned in SAL (#wikimedia-sre) [2023-06-27T07:48:38Z] <hashar> Restart Zuul due to stuck connection | T340518 | T309376

Change 929991 abandoned by Hashar:

[test/gerrit-ping@master] Dummy testing for T309376

Reason:

https://gerrit.wikimedia.org/r/929991