Page MenuHomePhabricator

Investigate Zuul 2.1.0-151-g30a433b that stops processing Gerrit events
Closed, ResolvedPublic

Description

I tried to upgrade Zuul from upstream commit 66c8e52 to 30a433b , that would fix T129938 but also include a major refactoring of how Zuul interact with Gerrit. It worked fine but after a few dozen of minutes Zuul stopped receiving/processing events from Gerrit and thus I reverted back to 2.1.0-95-g66c8e52-wmf1precise1

I have picked up a few other patches and built 2.1.0-151-g30a433b-wmf2precise1 but it has the same issue.

Potential thread dump is P3204

Need to reproduce the trouble (I havent been able to do so on my local machine though).

Upstream shortlog, the serie of patches by Joshua Hesketh is the Zuul/Gerrit refactoring:

$ git shortlog  66c8e52..30a433b --no-merges
Alexander Evseev (1):
      Split pipeline description by double newlines on status page

Andreas Jaeger (2):
      Use git.openstack.org everywhere
      Remove argparse from requirements

Antoine Musso (1):
      Update merge status after merge:merge is submitted

Doug Wiegley (1):
      Bump pbr minimum version, to avoid testrepository requirement

Evgeny Antyshev (2):
      Connection names for legacy configs
      Don't require 'commit' attribute in merge event

JP Sullivan (1):
      Add vim swap files to .gitignore

James E. Blair (8):
      Pass ZUUL_TEST_ROOT through tox
      Try to make test_idle less racy
      Add job mutex support
      Fix default merge failure reports
      Cloner: use cache if dest exists
      Add job tags
      Detect dependency cycles introduced with new patchsets
      Add report URL to status.json

Jan Hruban (1):
      Tidy up tests/base.py

Joshua Hesketh (14):
      Refactor sources out of triggers
      Add base class for reporters
      Add base class for sources
      Add base class for triggers
      Configure triggers dynamically
      Add support for 'connection' concept
      Document the new connections in zuul
      Add in extra connections tests
      Remove ActionReporters
      Move Item formatting into Reporters
      Fix regression in change tracking
      Fix memory leak reloading triggers
      Cache is held and managed by connections
      Don't reload connections on HUP

Ondřej Nový (1):
      Deprecated tox -downloadcache option removed

Paul Belanger (4):
      Bump APScheduler to >=3.0
      Remove webob requirements cap
      Log 'Received unrecognized event type' as warning
      Pin paramiko < 2.0.0

Richard Hedlind (1):
      Add exception handler to updateBuildDescriptions

Sachi King (1):
      Fix test for new WebOb

Thomas Bechtold (1):
      Fix documentation example

Tobias Henkel (1):
      Cloner: Don't fall back on infrastructure failure

Event Timeline

@Paladox Yup that is what 66c8e52 to 30a433b is mostly about: switching to that new connection system. Something is broken there somehow though ;(

@hashar but doint we need to update the zuul.conf file with those new configs for connections.

This comment was removed by Paladox.

It is back compatible with the old way of defining connections (such as a [gerrit] section) https://github.com/openstack-infra/zuul/blob/30a433b202/zuul/lib/connections.py#L56-L64

It definitely spawn just fine and start handling events but get stuck at some point due to an unknown condition. Probably a thread going deadlock somehow.

And I think P3204 is the thread dump, possibly deadlocked with Crypto.Random.UserFriendlyRNG

1Thread: 140230061446912
2 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
3 self.__bootstrap_inner()
4 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
5 self.run()
6 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 1510, in run
7 Random.atfork()
8 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
9 _UserFriendlyRNG.reinit()
10 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
11 _get_singleton().reinit()
12 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 154, in reinit
13 self._lock.acquire()
14Thread: 140230069839616
15 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
16 self.__bootstrap_inner()
17 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
18 self.run()
19 File "/usr/lib/python2.7/threading.py", line 504, in run
20 self.__target(*self.__args, **self.__kwargs)
21 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/rpclistener.py", line 76, in run
22 f(job)
23 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/rpclistener.py", line 118, in handle_enqueue
24 pipeline.source.getChange(event, project)
25 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 150, in getChange
26 refresh=refresh)
27 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
28 self._updateChange(change, history)
29 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange
30 data = self.connection.query(change.number)
31 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 287, in query
32 out, err = self._ssh(cmd)
33 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 362, in _ssh
34 self._open()
35 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 357, in _open
36 key_filename=self.keyfile)
37 File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 301, in connect
38 t.start_client()
39 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 455, in start_client
40 Random.atfork()
41 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
42 _UserFriendlyRNG.reinit()
43 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
44 _get_singleton().reinit()
45 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 154, in reinit
46 self._lock.acquire()
47Thread: 140231151974144
48 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
49 self.__bootstrap_inner()
50 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
51 self.run()
52 File "/usr/lib/python2.7/threading.py", line 504, in run
53 self.__target(*self.__args, **self.__kwargs)
54 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
55 runnable = self.queue.get()
56 File "/usr/lib/python2.7/Queue.py", line 168, in get
57 self.not_empty.wait()
58 File "/usr/lib/python2.7/threading.py", line 243, in wait
59 waiter.acquire()
60Thread: 140230564747008
61 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
62 self.__bootstrap_inner()
63 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
64 self.run()
65 File "/usr/lib/python2.7/threading.py", line 504, in run
66 self.__target(*self.__args, **self.__kwargs)
67 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 821, in _doPollLoop
68 self._pollLoop()
69 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 847, in _pollLoop
70 ret = poll.poll()
71Thread: 140231772464896
72 File "/usr/bin/zuul-server", line 10, in <module>
73 sys.exit(main())
74 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 227, in main
75 server.main()
76 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 196, in main
77 signal.pause()
78 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/cmd/__init__.py", line 42, in stack_dump_handler
79 log_str += "".join(traceback.format_stack(stack_frame))
80Thread: 140231101617920
81 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
82 self.__bootstrap_inner()
83 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
84 self.run()
85 File "/usr/lib/python2.7/threading.py", line 504, in run
86 self.__target(*self.__args, **self.__kwargs)
87 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
88 runnable = self.queue.get()
89 File "/usr/lib/python2.7/Queue.py", line 168, in get
90 self.not_empty.wait()
91 File "/usr/lib/python2.7/threading.py", line 243, in wait
92 waiter.acquire()
93Thread: 140231110010624
94 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
95 self.__bootstrap_inner()
96 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
97 self.run()
98 File "/usr/lib/python2.7/threading.py", line 504, in run
99 self.__target(*self.__args, **self.__kwargs)
100 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
101 runnable = self.queue.get()
102 File "/usr/lib/python2.7/Queue.py", line 168, in get
103 self.not_empty.wait()
104 File "/usr/lib/python2.7/threading.py", line 243, in wait
105 waiter.acquire()
106Thread: 140231118403328
107 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
108 self.__bootstrap_inner()
109 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
110 self.run()
111 File "/usr/lib/python2.7/threading.py", line 504, in run
112 self.__target(*self.__args, **self.__kwargs)
113 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
114 runnable = self.queue.get()
115 File "/usr/lib/python2.7/Queue.py", line 168, in get
116 self.not_empty.wait()
117 File "/usr/lib/python2.7/threading.py", line 243, in wait
118 waiter.acquire()
119Thread: 140231675930368
120 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
121 self.__bootstrap_inner()
122 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
123 self.run()
124 File "/usr/lib/python2.7/threading.py", line 504, in run
125 self.__target(*self.__args, **self.__kwargs)
126 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 739, in _doConnectLoop
127 self.connections_condition.wait()
128 File "/usr/lib/python2.7/threading.py", line 243, in wait
129 waiter.acquire()
130Thread: 140230589925120
131 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
132 self.__bootstrap_inner()
133 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
134 self.run()
135 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 126, in run
136 self._handleEvent()
137 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 48, in _handleEvent
138 ts, data = self.connection.getEvent()
139 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 263, in getEvent
140 return self.event_queue.get()
141 File "/usr/lib/python2.7/Queue.py", line 168, in get
142 self.not_empty.wait()
143 File "/usr/lib/python2.7/threading.py", line 243, in wait
144 waiter.acquire()
145Thread: 140230598317824
146 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
147 self.__bootstrap_inner()
148 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
149 self.run()
150 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 201, in run
151 self._run()
152 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 173, in _run
153 key_filename=self.keyfile)
154 File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 301, in connect
155 t.start_client()
156 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 455, in start_client
157 Random.atfork()
158 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
159 _UserFriendlyRNG.reinit()
160 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
161 _get_singleton().reinit()
162 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 156, in reinit
163 return _UserFriendlyRNG.reinit(self)
164 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 94, in reinit
165 self._ec.reinit()
166 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 62, in reinit
167 block = self._osrng.read(32*32)
168 File "/usr/lib/python2.7/dist-packages/Crypto/Random/OSRNG/rng_base.py", line 76, in read
169 data = self._read(N)
170Thread: 140230606710528
171 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
172 self.__bootstrap_inner()
173 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
174 self.run()
175 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 947, in run
176 self.wake_event.wait()
177 File "/usr/lib/python2.7/threading.py", line 403, in wait
178 self.__cond.wait(timeout)
179 File "/usr/lib/python2.7/threading.py", line 243, in wait
180 waiter.acquire()
181Thread: 140230581532416
182 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
183 self.__bootstrap_inner()
184 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
185 self.run()
186 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 1510, in run
187 Random.atfork()
188 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
189 _UserFriendlyRNG.reinit()
190 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
191 _get_singleton().reinit()
192 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 154, in reinit
193 self._lock.acquire()
194Thread: 140230615103232
195 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
196 self.__bootstrap_inner()
197 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
198 self.run()
199 File "/usr/lib/python2.7/threading.py", line 504, in run
200 self.__target(*self.__args, **self.__kwargs)
201 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
202 runnable = self.queue.get()
203 File "/usr/lib/python2.7/Queue.py", line 168, in get
204 self.not_empty.wait()
205 File "/usr/lib/python2.7/threading.py", line 243, in wait
206 waiter.acquire()
207Thread: 140230573139712
208 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
209 self.__bootstrap_inner()
210 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
211 self.run()
212 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/webapp.py", line 58, in run
213 self.server.serve_forever()
214 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 1084, in serve_forever
215 self.handle_request()
216 File "/usr/lib/python2.7/SocketServer.py", line 265, in handle_request
217 fd_sets = select.select([self], [], [], timeout)
218Thread: 140231633966848
219 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
220 self.__bootstrap_inner()
221 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
222 self.run()
223 File "/usr/lib/python2.7/threading.py", line 504, in run
224 self.__target(*self.__args, **self.__kwargs)
225 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
226 runnable = self.queue.get()
227 File "/usr/lib/python2.7/Queue.py", line 168, in get
228 self.not_empty.wait()
229 File "/usr/lib/python2.7/threading.py", line 243, in wait
230 waiter.acquire()
231Thread: 140231126796032
232 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
233 self.__bootstrap_inner()
234 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
235 self.run()
236 File "/usr/lib/python2.7/threading.py", line 504, in run
237 self.__target(*self.__args, **self.__kwargs)
238 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
239 runnable = self.queue.get()
240 File "/usr/lib/python2.7/Queue.py", line 168, in get
241 self.not_empty.wait()
242 File "/usr/lib/python2.7/threading.py", line 243, in wait
243 waiter.acquire()
244Thread: 140230078232320
245 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
246 self.__bootstrap_inner()
247 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
248 self.run()
249 File "/usr/lib/python2.7/threading.py", line 504, in run
250 self.__target(*self.__args, **self.__kwargs)
251 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 739, in _doConnectLoop
252 self.connections_condition.wait()
253 File "/usr/lib/python2.7/threading.py", line 243, in wait
254 waiter.acquire()
255Thread: 140231135188736
256 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
257 self.__bootstrap_inner()
258 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
259 self.run()
260 File "/usr/lib/python2.7/threading.py", line 504, in run
261 self.__target(*self.__args, **self.__kwargs)
262 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
263 runnable = self.queue.get()
264 File "/usr/lib/python2.7/Queue.py", line 168, in get
265 self.not_empty.wait()
266 File "/usr/lib/python2.7/threading.py", line 243, in wait
267 waiter.acquire()
268Thread: 140231642359552
269 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
270 self.__bootstrap_inner()
271 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
272 self.run()
273 File "/usr/lib/python2.7/threading.py", line 504, in run
274 self.__target(*self.__args, **self.__kwargs)
275 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
276 runnable = self.queue.get()
277 File "/usr/lib/python2.7/Queue.py", line 168, in get
278 self.not_empty.wait()
279 File "/usr/lib/python2.7/threading.py", line 243, in wait
280 waiter.acquire()
281Thread: 140231650752256
282 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
283 self.__bootstrap_inner()
284 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
285 self.run()
286 File "/usr/lib/python2.7/threading.py", line 504, in run
287 self.__target(*self.__args, **self.__kwargs)
288 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 739, in _doConnectLoop
289 self.connections_condition.wait()
290 File "/usr/lib/python2.7/threading.py", line 243, in wait
291 waiter.acquire()
292Thread: 140231659144960
293 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
294 self.__bootstrap_inner()
295 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
296 self.run()
297 File "/usr/lib/python2.7/threading.py", line 504, in run
298 self.__target(*self.__args, **self.__kwargs)
299 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 821, in _doPollLoop
300 self._pollLoop()
301 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 847, in _pollLoop
302 ret = poll.poll()
303Thread: 140231667537664
304 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
305 self.__bootstrap_inner()
306 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
307 self.run()
308 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/launcher/gearman.py", line 46, in run
309 self.wake_event.wait(300)
310 File "/usr/lib/python2.7/threading.py", line 403, in wait
311 self.__cond.wait(timeout)
312 File "/usr/lib/python2.7/threading.py", line 262, in wait
313 _sleep(delay)
314Thread: 140231143581440
315 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
316 self.__bootstrap_inner()
317 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
318 self.run()
319 File "/usr/lib/python2.7/threading.py", line 504, in run
320 self.__target(*self.__args, **self.__kwargs)
321 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
322 runnable = self.queue.get()
323 File "/usr/lib/python2.7/Queue.py", line 168, in get
324 self.not_empty.wait()
325 File "/usr/lib/python2.7/threading.py", line 243, in wait
326 waiter.acquire()
327Thread: 140231684323072
328 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
329 self.__bootstrap_inner()
330 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
331 self.run()
332 File "/usr/lib/python2.7/threading.py", line 504, in run
333 self.__target(*self.__args, **self.__kwargs)
334 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 821, in _doPollLoop
335 self._pollLoop()
336 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 847, in _pollLoop
337 ret = poll.poll()
338

@hashar thanks for finding that. Do we update that package.

Or do you know how to fix it please.

Or could it be a bug in zuul.

@hashar could it be that gallium was the problem since it went faulty on wednesday but you turned this update a week before. Sounds like it may be gallium but I carnt be sure.

If I knew the cause or a commit that fix it, that would have been done already. This task is about investigating the issue which has a first step needs a way to reproduce.

@hashar Maybe update zuul to a newer commit. Meaning if the current commit we are on that we backported from upstream. we should try a newer commit, to see if they fixed it without adding in the commit msg about fixing something.

@hashar could it be gerrit 2.8 not being supported since upstream use a newer gerrit version so there not really testing against gerrit 2.8

@hashar we could report it upstream to see what they say please. To see if they can reproduce with gerrit 2.8 if they do then it is the gerrit version we are using.

@hashar could it have been caused by https://phabricator.wikimedia.org/rCIZU52f0ef1a94c1aec4f83ed0e7d2f7ad425e9ad821 or https://phabricator.wikimedia.org/rCIZUc908abf822a85a461c9563e17a0c8eca69c377b0 it might be because of the new connection like model like you said but I doint know why it is not giving us any logs to go one.

Also could it be related to projects not being able to clone because it carn't resolve gerrit host name. <!-- sound like our most likely problem

@hashar it seems it is deffintly related to

File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 362, in _ssh
  self._open()

Since it all links back to ^^

https://phabricator.wikimedia.org/rCIZU52f0ef1a94c1aec4f83ed0e7d2f7ad425e9ad821

Maybe try by building it again but reverting ^^ and try it. See if it will work.

https://phabricator.wikimedia.org/rCIZUc908abf822a85a461c9563e17a0c8eca69c377b0 is included in the current run.

Really there is no point in pasting random commits. Have to find time to attempt to reproduce the issue with extra logging and then debug it :(

Yeah they are all potential candidates. But I want to reproduce the issue on labs first, take trace and from there we can test out / cherry pick upstream commits and make sure it/they actually fix the issue.

@hashar zuul is dropping jenkins support in v3.

@hashar sorry for asking but could it be a gerrit bug since according to https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.8.5.html it says this

Upgrade SSHD to version 0.11.0.

Fixes Issue 2406: "git clone" hangs after 100% resolving deltas with git over SSH.

Fixes a number of other issues including a null pointer exception that could cause ssh commands to hang.

and ssh hangs for us.

@hashar could you try reinstalling it on the server as https://phabricator.wikimedia.org/rOPUP29188f7ea628ff8148544923957069bb62088a9b has been merged to see if we can get some information on why it stalls.

Also there's a chance it was a one time thing.

Please.

@hashar It seems to be a gerrit bug since according to https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.9.2.html they fixed a sshd bug that caused it to freeze on stream-events which zuul uses.

Ive been testing and using gerrit 2.12.2 and 2.12.3 does not freeze.

Gerrit is due to be updated this week on Wikimedia so we can test.

The version was downgraded back to 0.9 but then upgraded in 2.10.3.

https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.10.3.html

Using gerrit 2.12 and zuul version 2.1.0-151-g30a433b-wmf1jessie1 works no ssh problems.

Paladox raised the priority of this task from Low to Medium.Jul 12 2016, 9:21 PM

Changing to normal so we can do some debugging which we are doing tomarror.

Let's hope it was a one time thing if it happends to have the same problem I hope it will give us some idea of what is causing it :).

Mentioned in SAL [2016-07-13T12:36:07Z] <hashar> T137525 Upgrading Zuul 2.1.0-95-g66c8e52-wmf1precise1 ... zuul_2.1.0-151-g30a433b-wmf1precise1_amd64.deb

Mentioned in SAL [2016-07-13T12:36:08Z] <hashar> T137525 Upgrading Zuul 2.1.0-95-g66c8e52-wmf1precise1 ... zuul_2.1.0-151-g30a433b-wmf1precise1_amd64.deb

Mentioned in SAL [2016-07-13T12:52:59Z] <hashar> CI is processing with Zuul 2.1.0-151-g30a433b. It might stop processing events at anytime though due to T137525

$ sudo -u zuul /usr/share/python/zuul/bin/python /usr/bin/zuul-server -c /etc/zuul/zuul-server.conf -d
/usr/lib/python2.7/dist-packages/paramiko/client.py:95: UserWarning: Unknown ssh-rsa host key for [ytterbium.wikimedia.org]:29418: dce9687b991b27d0f9fdce6a2ebf92e1
  (key.get_name(), hostname, hexlify(key.get_fingerprint())))

It is connected to Gerrit apparently though:

Session     Start     Idle   User            Remote Host
--------------------------------------------------------------
f85690c4 12:47:06 00:00:09   jenkins-bot     <gallium IPv6 address>
d87aac55 12:48:05 00:00:03   jenkins-bot     <gallium IPv6 address>

The new Zuul has been running for 40 minutes now and seems to be all fine. Maybe it was some random rare deadlock ? I am keeping it running for now and will reassess later.

Revert would be to go back to previous good version. On gallium:

# /etc/init.d/zuul stop
# dpkg -i /root/zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb

Mentioned in SAL [2016-07-13T13:46:36Z] <hashar> T137525 Stopped zuul that ran in a terminal (with -d). Started it with the init script.

Zuul did not properly connect to Gerrit / stream-events:

+ ssh -p 29418 hashar@gerrit.wikimedia.org 'gerrit show-connections -w'
Session     Start     Idle   User            Remote Host
--------------------------------------------------------------
066c6c0c 13:44:45 00:05:17                   <gallium IPv6>

Mentioned in SAL [2016-07-13T13:58:24Z] <hashar> T137525 reverted Zuul back to zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb . It could not connect to Gerrit reliably

Mentioned in SAL [2016-07-13T13:58:26Z] <hashar> T137525 reverted Zuul back to zuul_2.1.0-95-g66c8e52-wmf1precise1_amd64.deb . It could not connect to Gerrit reliably

Thread dump of Zuul just after it started and with Gerrit showing a connection from gallium but without a username is P3415. We might want to diff it with the previous one ( P3204 ).

12016-07-13 13:56:48,562 DEBUG zuul.stack_dump: Thread: 139707115607808
2 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
3 self.__bootstrap_inner()
4 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
5 self.run()
6 File "/usr/lib/python2.7/threading.py", line 504, in run
7 self.__target(*self.__args, **self.__kwargs)
8 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/rpclistener.py", line 68, in run
9 job = self.worker.getJob()
10 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 1876, in getJob
11 job = self.job_queue.get()
12 File "/usr/lib/python2.7/Queue.py", line 168, in get
13 self.not_empty.wait()
14 File "/usr/lib/python2.7/threading.py", line 243, in wait
15 waiter.acquire()
16Thread: 139707124000512
17 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
18 self.__bootstrap_inner()
19 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
20 self.run()
21 File "/usr/lib/python2.7/threading.py", line 504, in run
22 self.__target(*self.__args, **self.__kwargs)
23 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 739, in _doConnectLoop
24 self.connections_condition.wait()
25 File "/usr/lib/python2.7/threading.py", line 243, in wait
26 waiter.acquire()
27Thread: 139708306814720
28 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
29 self.__bootstrap_inner()
30 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
31 self.run()
32 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/launcher/gearman.py", line 46, in run
33 self.wake_event.wait(300)
34 File "/usr/lib/python2.7/threading.py", line 403, in wait
35 self.__cond.wait(timeout)
36 File "/usr/lib/python2.7/threading.py", line 262, in wait
37 _sleep(delay)
38Thread: 139707652478720
39 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
40 self.__bootstrap_inner()
41 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
42 self.run()
43 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/scheduler.py", line 947, in run
44 self.wake_event.wait()
45 File "/usr/lib/python2.7/threading.py", line 403, in wait
46 self.__cond.wait(timeout)
47 File "/usr/lib/python2.7/threading.py", line 243, in wait
48 waiter.acquire()
49Thread: 139707677656832
50 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
51 self.__bootstrap_inner()
52 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
53 self.run()
54 File "/usr/lib/python2.7/threading.py", line 504, in run
55 self.__target(*self.__args, **self.__kwargs)
56 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
57 runnable = self.queue.get()
58 File "/usr/lib/python2.7/Queue.py", line 168, in get
59 self.not_empty.wait()
60 File "/usr/lib/python2.7/threading.py", line 243, in wait
61 waiter.acquire()
62Thread: 139708383725312
63 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
64 self.__bootstrap_inner()
65 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
66 self.run()
67 File "/usr/lib/python2.7/threading.py", line 504, in run
68 self.__target(*self.__args, **self.__kwargs)
69 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 739, in _doConnectLoop
70 self.connections_condition.wait()
71 File "/usr/lib/python2.7/threading.py", line 243, in wait
72 waiter.acquire()
73Thread: 139708256458496
74 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
75 self.__bootstrap_inner()
76 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
77 self.run()
78 File "/usr/lib/python2.7/threading.py", line 504, in run
79 self.__target(*self.__args, **self.__kwargs)
80 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
81 runnable = self.queue.get()
82 File "/usr/lib/python2.7/Queue.py", line 168, in get
83 self.not_empty.wait()
84 File "/usr/lib/python2.7/threading.py", line 243, in wait
85 waiter.acquire()
86Thread: 139708264851200
87 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
88 self.__bootstrap_inner()
89 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
90 self.run()
91 File "/usr/lib/python2.7/threading.py", line 504, in run
92 self.__target(*self.__args, **self.__kwargs)
93 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
94 runnable = self.queue.get()
95 File "/usr/lib/python2.7/Queue.py", line 168, in get
96 self.not_empty.wait()
97 File "/usr/lib/python2.7/threading.py", line 243, in wait
98 waiter.acquire()
99Thread: 139708273243904
100 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
101 self.__bootstrap_inner()
102 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
103 self.run()
104 File "/usr/lib/python2.7/threading.py", line 504, in run
105 self.__target(*self.__args, **self.__kwargs)
106 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
107 runnable = self.queue.get()
108 File "/usr/lib/python2.7/Queue.py", line 168, in get
109 self.not_empty.wait()
110 File "/usr/lib/python2.7/threading.py", line 243, in wait
111 waiter.acquire()
112Thread: 139707140785920
113 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
114 self.__bootstrap_inner()
115 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
116 self.run()
117 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/webapp.py", line 58, in run
118 self.server.serve_forever()
119 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 1084, in serve_forever
120 self.handle_request()
121 File "/usr/lib/python2.7/SocketServer.py", line 265, in handle_request
122 fd_sets = select.select([self], [], [], timeout)
123Thread: 139707686049536
124 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
125 self.__bootstrap_inner()
126 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
127 self.run()
128 File "/usr/lib/python2.7/threading.py", line 504, in run
129 self.__target(*self.__args, **self.__kwargs)
130 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
131 runnable = self.queue.get()
132 File "/usr/lib/python2.7/Queue.py", line 168, in get
133 self.not_empty.wait()
134 File "/usr/lib/python2.7/threading.py", line 243, in wait
135 waiter.acquire()
136Thread: 139707694442240
137 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
138 self.__bootstrap_inner()
139 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
140 self.run()
141 File "/usr/lib/python2.7/threading.py", line 504, in run
142 self.__target(*self.__args, **self.__kwargs)
143 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
144 runnable = self.queue.get()
145 File "/usr/lib/python2.7/Queue.py", line 168, in get
146 self.not_empty.wait()
147 File "/usr/lib/python2.7/threading.py", line 243, in wait
148 waiter.acquire()
149Thread: 139707132393216
150 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
151 self.__bootstrap_inner()
152 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
153 self.run()
154 File "/usr/lib/python2.7/threading.py", line 504, in run
155 self.__target(*self.__args, **self.__kwargs)
156 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 821, in _doPollLoop
157 self._pollLoop()
158 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 847, in _pollLoop
159 ret = poll.poll()
160Thread: 139707660871424
161 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
162 self.__bootstrap_inner()
163 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
164 self.run()
165 File "/usr/lib/python2.7/threading.py", line 504, in run
166 self.__target(*self.__args, **self.__kwargs)
167 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
168 runnable = self.queue.get()
169 File "/usr/lib/python2.7/Queue.py", line 168, in get
170 self.not_empty.wait()
171 File "/usr/lib/python2.7/threading.py", line 243, in wait
172 waiter.acquire()
173Thread: 139708290029312
174 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
175 self.__bootstrap_inner()
176 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
177 self.run()
178 File "/usr/lib/python2.7/threading.py", line 504, in run
179 self.__target(*self.__args, **self.__kwargs)
180 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 739, in _doConnectLoop
181 self.connections_condition.wait()
182 File "/usr/lib/python2.7/threading.py", line 243, in wait
183 waiter.acquire()
184Thread: 139707702834944
185 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
186 self.__bootstrap_inner()
187 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
188 self.run()
189 File "/usr/lib/python2.7/threading.py", line 504, in run
190 self.__target(*self.__args, **self.__kwargs)
191 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
192 runnable = self.queue.get()
193 File "/usr/lib/python2.7/Queue.py", line 168, in get
194 self.not_empty.wait()
195 File "/usr/lib/python2.7/threading.py", line 243, in wait
196 waiter.acquire()
197Thread: 139708281636608
198 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
199 self.__bootstrap_inner()
200 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
201 self.run()
202 File "/usr/lib/python2.7/threading.py", line 504, in run
203 self.__target(*self.__args, **self.__kwargs)
204 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
205 runnable = self.queue.get()
206 File "/usr/lib/python2.7/Queue.py", line 168, in get
207 self.not_empty.wait()
208 File "/usr/lib/python2.7/threading.py", line 243, in wait
209 waiter.acquire()
210Thread: 139707669264128
211 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
212 self.__bootstrap_inner()
213 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
214 self.run()
215 File "/usr/lib/python2.7/threading.py", line 504, in run
216 self.__target(*self.__args, **self.__kwargs)
217 File "/usr/lib/python2.7/dist-packages/paste/httpserver.py", line 866, in worker_thread_callback
218 runnable = self.queue.get()
219 File "/usr/lib/python2.7/Queue.py", line 168, in get
220 self.not_empty.wait()
221 File "/usr/lib/python2.7/threading.py", line 243, in wait
222 waiter.acquire()
223Thread: 139707165964032
224 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
225 self.__bootstrap_inner()
226 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
227 self.run()
228 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 201, in run
229 self._run()
230 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 173, in _run
231 key_filename=self.keyfile)
232 File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 301, in connect
233 t.start_client()
234 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 455, in start_client
235 Random.atfork()
236 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
237 _UserFriendlyRNG.reinit()
238 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
239 _get_singleton().reinit()
240 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 154, in reinit
241 self._lock.acquire()
242Thread: 139708480259840
243 File "/usr/bin/zuul-server", line 10, in <module>
244 sys.exit(main())
245 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 227, in main
246 server.main()
247 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/cmd/server.py", line 196, in main
248 signal.pause()
249 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/cmd/__init__.py", line 42, in stack_dump_handler
250 log_str += "".join(traceback.format_stack(stack_frame))
251Thread: 139707149178624
252 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
253 self.__bootstrap_inner()
254 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
255 self.run()
256 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 1510, in run
257 Random.atfork()
258 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
259 _UserFriendlyRNG.reinit()
260 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
261 _get_singleton().reinit()
262 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 156, in reinit
263 return _UserFriendlyRNG.reinit(self)
264 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 94, in reinit
265 self._ec.reinit()
266 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 62, in reinit
267 block = self._osrng.read(32*32)
268 File "/usr/lib/python2.7/dist-packages/Crypto/Random/OSRNG/rng_base.py", line 76, in read
269 data = self._read(N)
270Thread: 139707157571328
271 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
272 self.__bootstrap_inner()
273 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
274 self.run()
275 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 126, in run
276 self._handleEvent()
277 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 48, in _handleEvent
278 ts, data = self.connection.getEvent()
279 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 263, in getEvent
280 return self.event_queue.get()
281 File "/usr/lib/python2.7/Queue.py", line 168, in get
282 self.not_empty.wait()
283 File "/usr/lib/python2.7/threading.py", line 243, in wait
284 waiter.acquire()
285Thread: 139708392118016
286 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
287 self.__bootstrap_inner()
288 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
289 self.run()
290 File "/usr/lib/python2.7/threading.py", line 504, in run
291 self.__target(*self.__args, **self.__kwargs)
292 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 821, in _doPollLoop
293 self._pollLoop()
294 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 847, in _pollLoop
295 ret = poll.poll()
296Thread: 139708298422016
297 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
298 self.__bootstrap_inner()
299 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
300 self.run()
301 File "/usr/lib/python2.7/threading.py", line 504, in run
302 self.__target(*self.__args, **self.__kwargs)
303 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 821, in _doPollLoop
304 self._pollLoop()
305 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/gear/__init__.py", line 847, in _pollLoop
306 ret = poll.poll()

summary

During a chat with @thcipriani he conducted a bunch of proof checks. Among others:

  • Confirm host finger print with ssh-keyscan -v ytterbium.wikimedia.org -p 29418
  • /var/lib/zuul/.ssh/known_hosts has encrypted host but has the key
  • ssh hmac/algos selection. Gerrit 29418 uses old ones supported by paramiko
  • paramiko version

A thing that is striking is that running the daemon from the command line in foreground mode (zuul-server -d) worked all fine. At least for 45 minutes. At 14:00 UTC I have killed it and spawned the usual way /etc/init.d/zuul start, that is when it deadlocks with the above trace P3415.

So there is something varying when Zuul is put in daemon / running in context of init.d.

I have filtered out a bunch of threads from P3415 and ends up with three threads P3423

1Same as P3415 with useless threads removed.
2
3Thread: 139707165964032
4 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
5 self.__bootstrap_inner()
6 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
7 self.run()
8 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 201, in run
9 self._run()
10 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 173, in _run
11 key_filename=self.keyfile)
12 File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 301, in connect
13 t.start_client()
14 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 455, in start_client
15 Random.atfork()
16 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
17 _UserFriendlyRNG.reinit()
18 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
19 _get_singleton().reinit()
20 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 154, in reinit
21 self._lock.acquire()
22
23Thread: 139707149178624
24 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
25 self.__bootstrap_inner()
26 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
27 self.run()
28 File "/usr/lib/python2.7/dist-packages/paramiko/transport.py", line 1510, in run
29 Random.atfork()
30 File "/usr/lib/python2.7/dist-packages/Crypto/Random/__init__.py", line 37, in atfork
31 _UserFriendlyRNG.reinit()
32 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 209, in reinit
33 _get_singleton().reinit()
34 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 156, in reinit
35 return _UserFriendlyRNG.reinit(self)
36 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 94, in reinit
37 self._ec.reinit()
38 File "/usr/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py", line 62, in reinit
39 block = self._osrng.read(32*32)
40 File "/usr/lib/python2.7/dist-packages/Crypto/Random/OSRNG/rng_base.py", line 76, in read
41 data = self._read(N)
42
43Thread: 139707157571328
44 File "/usr/lib/python2.7/threading.py", line 524, in __bootstrap
45 self.__bootstrap_inner()
46 File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
47 self.run()
48 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 126, in run
49 self._handleEvent()
50 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 48, in _handleEvent
51 ts, data = self.connection.getEvent()
52 File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 263, in getEvent
53 return self.event_queue.get()
54 File "/usr/lib/python2.7/Queue.py", line 168, in get
55 self.not_empty.wait()
56 File "/usr/lib/python2.7/threading.py", line 243, in wait
57 waiter.acquire()

Namely:

  • zuul.connection.gerrit GerritWatcher._run() connecting to Gerrit over SSH.
    • Locked via Crypto/Random/_UserFriendlyRNG reinit()
  • paramiko.transport invoking Random.atfork() which tries to read(32*32) according to the code from /dev/urandom.
    • Does it really read 32*32 bytes?
    • Is it really from /dev/urandom
  • zuul.connection.gerrit GerritEventConnector() waiting for events to shows up in the queue.

HINT HINT. OpenStack infra had a discussion about a similar thing on http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2014-12-09.log . Then it is quite old:

2014-12-09T03:28:42  <jeblair> clarkb, fungi: good news and bad news on the reproducibility
 front; when i started builds of 1000 fake servers on precise and trusty, it looked like trusty was
 lagging behind measurably.  but then i ended up with 999 threads deadlocked in
 "/usr/local/lib/python2.7/dist-packages/Crypto/Random/_UserFriendlyRNG.py" which has
 had a significant negative impact on my ability to test
2014-12-09T03:29:16  <jeblair> clarkb, fungi: pycrypto is 2.6.1 on trusty and 2.4.1 on precise

2014-12-09T03:31:26  <jeblair> clarkb, fungi: oh neat, it wasn't a deadlock; it unwedged

2014-12-09T03:35:12  <jeblair> interestingly pycrypto 2.6.1 is a release to fix a cve related
to entropy: https://www.dlitz.net/software/pycrypto/

2014-12-09T03:33:01  <clarkb> jeblair: do we just not have the same entropy on trusty?
2014-12-09T03:36:43  <jeblair> however, i don't think it affects us
2014-12-09T03:37:33  <jeblair> clarkb: it reads from urandom, so it shouldn't be blocking on that

2014-12-09T03:52:27  <jeblair> clarkb, fungi: when i did the thread dump, 186 threads were
trying to acquire the lock, 14 were doing computation outside the lock, 1 held the lock and was
doing computation, and 800 threads were releasing the lock
2014-12-09T03:53:21  <jeblair> clarkb, fungi: what i'm getting from this is that we weren't
stuck waiting on entropy, but rather, when threads released the lock, they stopped getting
scheduled.

The jeblair went to sleep / EOD. Fast forward to the day after

2014-12-09T18:32:10  <jeblair> clarkb, fungi: http://paste.openstack.org/show/148067/
2014-12-09T18:32:28  <jeblair> clarkb, fungi: that script takes 3 seconds on our precise
 nodepool config, and 1 minute 12 seconds on trusty
2014-12-09T18:33:09  <jeblair> clarkb, fungi: it exhibits the same behavior where lots
of threads are apparently releasing a lock at the same time that i saw in my artificial test
 yesterday
2014-12-09T18:33:25  <jeblair> clarkb, fungi: i do not know if that is the _cause_ of our
 actual problems, or if it _shares the cause_

2014-12-09T18:38:35  <jeblair> clarkb, fungi: i can reproduce on precise with merely an upgrade of pycrypto from 2.4.1 to 2.6.1

Post lunch:

2014-12-09T21:20:28  <jeblair> clarkb, fungi: warning: GMP or MPIR library not found;
Not building Crypto.PublicKey._fastmath.
2014-12-09T21:42:35  <jeblair> fungi, clarkb: so my best theory is that what i've been seeing in my testing is just pycrypto hogging the cpu by doing lots of math while holding the GIL
2014-12-09T21:43:42  <jeblair> and what looks like threads stuck releasing the lock is threads being starved (it locks and releases a lot, so many of the releases are not at the same point in the cycle)

So that was merely a perf issue due to a missing fast math library. Got solved by adding the lib so pycrypto compiles with GMP:

Install libgmp-dev on nodepool  https://review.openstack.org/140490

Most probably unrelated.

http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2014-12-10.log

2014-12-09T23:37:21  <jeblair> clarkb: reduced to: set {paramiko, ecdsa}
2014-12-10T00:07:47  <jeblair> clarkb, fungi: one line in fact:
2014-12-10T00:07:50  <jeblair> in paramiko/rsakey.py
2014-12-10T00:07:50  <jeblair> -        rsa = RSA.generate(bits, rng.read, progress_func)
2014-12-10T00:07:50  <jeblair> +        rsa = RSA.generate(bits, os.urandom, progress_func)
2014-12-10T00:08:02  <jeblair> this has a familiar ring to it somehow
2014-12-10T00:10:47  <jeblair> https://github.com/paramiko/paramiko/commit/6f211115f49edcea7d23b764d7cf3a84ff12f5f0
2014-12-10T00:11:32  <clarkb> 5) it makes multithreaded python terrible

etc.... sleep time for me.

Change 298957 had a related patch set uploaded (by Paladox):
2.1.0-151-g30a433b-wmf3precise1

https://gerrit.wikimedia.org/r/298957

Change 298957 abandoned by Paladox:
2.1.0-151-g30a433b-wmf3precise1

Reason:
Per hashar

https://gerrit.wikimedia.org/r/298957

Change 298957 restored by Paladox:
2.1.0-151-g30a433b-wmf3precise1

Reason:
Actually on Jessie it runs 2.6.1 and works fine so I am wondering weather because we use an old version on precise that might be the reason.

https://gerrit.wikimedia.org/r/298957

I have finally managed to reproduce the deadlock on the labs instance integration-slave-precise-1002 with zuul_2.1.0-151-g30a433b-wmf1precise1_amd64.deb and a very dummy layout file \O/

Exact same trace with the paramiko transport thread being blocked by UserFriendlyRNG attempting to acquire entropy. A read() is blocked.

Turns out the issue is that python-daemon closes all file descriptor when daemonizing the application. That causes paramiko / Crypto random generator to loose track of /dev/urandom

Paramiko issue: https://github.com/paramiko/paramiko/issues/59

That is apparently solved by switching from PyCrypto Random to os.urandom https://github.com/paramiko/paramiko/commit/6f211115f49edcea7d23b764d7cf3a84ff12f5f0

Nodepool has it fixed in https://review.openstack.org/#/c/269303/ by bumping paramiko to 1.11.6.


I have confirmed that Random.Crypto is imported before the daemonization occurs. In zuul/cmd/server.py:

+ sys.exit(1)
  with daemon.DaemonContext(pidfile=pid):
      server.main()

Then using python -v to look at the import as:

/usr/share/python/zuul/bin/python -v /usr/bin/zuul-server -c /etc/zuul/zuul-server.conf

That shows paramiko / Crypto being imported :(

@hashar does that mean we have to update nodepool but also shoulden we use version paramiko 1.11.6 on precise

@hashar we could do it so it imports it after it daemonizing.

And then potentially upload the workaround to openstack for Zuul.

Change 298957 abandoned by Paladox:
2.1.0-151-g30a433b-wmf3precise1

https://gerrit.wikimedia.org/r/298957

hashar raised the priority of this task from Medium to High.

The root cause is the Zuul refactoring of Gerrit connections. The root class of all zuul CLI imports zuul.lib.connections which import paramiko / Crypto.Random. Thus the random number generator is initialized and points to /dev/urandom.

When the Server command line invokes the daemonization, python-daemon close all file descriptors. Including /dev/urandom which I have confirmed via strace. Then the server is started, attempt to configure the Gerrit connection. That invokes paramiko.SSHClient() which get the SSH headline, then attempt to get some random numbers. The read() cals stall on a closed?? file descriptor and the whole crap deadlocks.

It is a bug in the code refactoring. The zuul.lib.connections should not be global but done in the method that relies on it ZuulApp.configure.connections()

The fix is fairly trivial:

diff --git a/zuul/cmd/__init__.py b/zuul/cmd/__init__.py                                                                                         
index 2902c50..8787575 100644                                                                                                                    
--- a/zuul/cmd/__init__.py
+++ b/zuul/cmd/__init__.py
@@ -26,8 +26,6 @@ import traceback
 
 yappi = extras.try_import('yappi')
 
-import zuul.lib.connections
-
 # Do not import modules that will pull in paramiko which must not be
 # imported until after the daemonization.
 # https://github.com/paramiko/paramiko/issues/59
@@ -91,5 +89,8 @@ class ZuulApp(object):
             logging.basicConfig(level=logging.DEBUG)
 
     def configure_connections(self):
+        # See comment at top of file about zuul imports
+        import zuul.lib.connections
+
         self.connections = zuul.lib.connections.configure_connections(
             self.config)

Long term we want to upgrade to Paramiko 1.11.6 or later which has https://github.com/paramiko/paramiko/commit/6f211115f49edcea7d23b764d7cf3a84ff12f5f0 switching paramiko to use os.urandom instead of a global relying on a file descriptor.

Debian Jessie comes with paramiko 1.15.1. So we should be fine when we migrate Zuul to a Jessie host.

Change 299136 had a related patch set uploaded (by Hashar):
Do not import paramiko before daemonization

https://gerrit.wikimedia.org/r/299136

Change 299139 had a related patch set uploaded (by Hashar):
2.1.0-151-g30a433b-wmf3precise1

https://gerrit.wikimedia.org/r/299139

I have rebuild the Debian package and published it on https://people.wikimedia.org/~hashar/debs/zuul_2.1.0-151-g30a433b/

Also includes a few other unrelated bugfix.

Will want to deploy zuul_2.1.0-151-g30a433b-wmf3precise1_amd64.deb next week.

Change 299136 merged by Hashar:
Do not import paramiko before daemonization

https://gerrit.wikimedia.org/r/299136

Change 299139 merged by Hashar:
2.1.0-151-g30a433b-wmf3precise1

https://gerrit.wikimedia.org/r/299139

Mentioned in SAL [2016-07-18T08:31:45Z] <hashar> gallium: upgrading Zuul 2.1.0-95-g66c8e52-wmf1precise1 .. zuul_2.1.0-151-g30a433b-wmf3precise1 T137525

Fixed by moving an import ( https://gerrit.wikimedia.org/r/299136 )

Pushed / included in 2.1.0-151-g30a433b-wmf3precise1 which I have deployed on gallium at 8:30am UTC. Package also fix a few other issues.

Change 308805 had a related patch set uploaded (by Hashar):
zuul: stop logging paramiko

https://gerrit.wikimedia.org/r/308805

Change 308805 merged by Giuseppe Lavagetto:
zuul: stop logging paramiko

https://gerrit.wikimedia.org/r/308805