Page MenuHomePhabricator

CI / Zuul is no more processing events
Closed, ResolvedPublic

Description

Zuul trace obtained by sending SIGUSR2 which write the dump to the debug.log file P8357

last events:

2019-04-05 20:39:40,514 DEBUG zuul.Scheduler: Run handler sleeping
2019-04-05 20:42:16,006 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 20:47:16,006 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 20:52:16,007 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 20:57:16,007 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:02:16,008 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:07:16,008 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:12:16,009 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:17:16,009 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:22:16,010 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:27:16,010 DEBUG zuul.Gearman: Looking for lost builds
2019-04-05 21:27:50,845 DEBUG zuul.stack_dump: Thread: 140422873605888

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-04-05T21:29:59Z] <hashar> CI / Zuul is no more processing events / T220243

So Zuul has two connection established:

contint1001$ sudo netstat -tp|grep gerrit
tcp6       0      0 contint1001.wikim:53720 gerrit.wikimedia.:29418 ESTABLISHED 24045/python    
tcp6       0      0 contint1001.wikim:53670 gerrit.wikimedia.:29418 ESTABLISHED 24045/python
'gerrit show-connections'
Session    User            Remote Host
--------------------------------------------------------------
2765248a   jenkins-bot     contint1001.wikimedi
a746f4a4   jenkins-bot     contint1001.wikimedi

I think there is one that is deadlocked somehow when trying to get a change:

	Thread: 140422238095104
	  File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
	    self.__bootstrap_inner()
	  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
	    self.run()
	  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 128, in run
	    self._handleEvent()
	  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 116, in _handleEvent
	    event.change_number, event.patch_number, refresh=True)
	  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
	    self._updateChange(change, history)
	  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange
	    data = self.connection.query(change.number)
	  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 291, in query
	    out, err = self._ssh(cmd)
	  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 378, in _ssh
	    out = stdout.read()
	  File "/usr/lib/python2.7/dist-packages/paramiko/file.py", line 150, in read
	    new_data = self._read(self._DEFAULT_BUFSIZE)
	  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 1217, in _read
	    return self.channel.recv(size)
	  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 596, in recv
	    out = self.in_buffer.read(nbytes, self.timeout)
	  File "/usr/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 147, in read
	    self._cv.wait(timeout)
	  File "/usr/lib/python2.7/threading.py", line 340, in wait
	    waiter.acquire()

[2019-04-05 17:28:48,600 +0000] 2765248a jenkins-bot a/75 LOGIN FROM 2620:0:861:1:208:80:154:17

[2019-04-05 20:33:32,056 +0000] a746f4a4 jenkins-bot a/75 gerrit.review.--project.mediawiki/core.--message.Post-merge build succeeded.
[2019-04-05 20:34:06,620 +0000] a746f4a4 jenkins-bot a/75 gerrit.review.--project.mediawiki/extensions/PageTriage.--message.Post-merge build succeeded.
[2019-04-05 20:34:11,225 +0000] a746f4a4 jenkins-bot a/75 gerrit.review.--project.mediawiki/services/parsoid.--message.Main test build failed.
[2019-04-05 20:38:28,123 +0000] a746f4a4 jenkins-bot a/75 gerrit.review.--project.mediawiki/extensions/CloseWikis.--message.Gate pipeline build succeeded.
[2019-04-05 20:38:28,207 +0000] a746f4a4 jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.492449 14ms 17ms 0
[2019-04-05 20:39:40,326 +0000] a746f4a4 jenkins-bot a/75 gerrit.review.--project.mediawiki/core.--message.Gate pipeline build failed.
[2019-04-05 20:39:40,416 +0000] a746f4a4 jenkins-bot a/75 gerrit.query.--format.json.--all-approvals.--comments.--commit-message.--current-patch-set.--dependencies.--files.--patch-sets.--submit-records.498202 17ms 28ms 0

[2019-04-05 21:37:35,112 +0000] 2765248a jenkins-bot a/75 LOGOUT

[2019-04-05 21:39:02,151 +0000] aa7f532d jenkins-bot a/75 LOGIN FROM 2620:0:861:1:208:80:154:17

Given the last event in Zuul was before 20:39:40

Mentioned in SAL (#wikimedia-operations) [2019-04-05T21:45:48Z] <hashar> thcipriani restarted Gerrit. CI works again # T220243

Then looking in the Zuul error.log:

2019-04-05 21:38:34,742 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 128, in run
    self._handleEvent()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 116, in _handleEvent
    event.change_number, event.patch_number, refresh=True)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
    self._updateChange(change, history)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange
    data = self.connection.query(change.number)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 291, in query
    out, err = self._ssh(cmd)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 387, in _ssh
    raise Exception("Gerrit error executing %s" % command)
Exception: Gerrit error executing gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records 501765
2019-04-05 21:38:34,745 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
Traceback (most recent call last):
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 128, in run
    self._handleEvent()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 116, in _handleEvent
    event.change_number, event.patch_number, refresh=True)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
    self._updateChange(change, history)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange
    data = self.connection.query(change.number)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 291, in query
    out, err = self._ssh(cmd)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 372, in _ssh
    self._open()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 361, in _open
    key_filename=self.keyfile)
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 251, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/dist-packages/paramiko/util.py", line 278, in retry_on_signal
    return function()
  File "/usr/lib/python2.7/dist-packages/paramiko/client.py", line 251, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
2019-04-05 21:38:34,757 ERROR zuul.GerritEventConnector: Exception moving Gerrit event:
error: [Errno 111] Connection refused

etc until 21:38:36

hashar added a subscriber: thcipriani.

Solved by restarting Gerrit due to an issue which I think is already tracked somewhere else.

And for the record Gerrit had a thread for Zuul doing:

SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records 501765 (jenkins-bot)

Which was not sending any reply.

SSH gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records 501765 (jenkins-bot)
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
com.google.common.cache.LocalCache.get(LocalCache.java:3943)
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
com.google.gerrit.server.events.EventFactory.asAccountAttribute(EventFactory.java:570)
com.google.gerrit.server.events.EventFactory.asChangeAttribute(EventFactory.java:157)
com.google.gerrit.server.query.change.OutputStreamQuery.buildChangeAttribute(OutputStreamQuery.java:239)
com.google.gerrit.server.query.change.OutputStreamQuery.query(OutputStreamQuery.java:204)
com.google.gerrit.sshd.commands.Query.run(Query.java:103)
com.google.gerrit.sshd.SshCommand$1.run(SshCommand.java:36)
com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:459)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:558)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Or it is because the cache were being flushed and its a redhearing

SSH gerrit flush-caches --cache accounts (thcipriani)
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
com.google.common.cache.LocalCache$Segment.clear(LocalCache.java:3180)
com.google.common.cache.LocalCache.clear(LocalCache.java:4263)
com.google.common.cache.LocalCache$LocalManualCache.invalidateAll(LocalCache.java:4903)
com.google.gerrit.server.config.FlushCache.apply(FlushCache.java:56)
com.google.gerrit.server.config.PostCaches.flush(PostCaches.java:128)
com.google.gerrit.server.config.PostCaches.apply(PostCaches.java:88)
com.google.gerrit.sshd.commands.FlushCaches.run(FlushCaches.java:80)
com.google.gerrit.sshd.SshCommand$1.run(SshCommand.java:36)
com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:459)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:558)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Anyway, not much to investigate there :(

I am pretty sure that is the same issue I have debugged a year later: T246973