For some reason, CI might end up in a stall state. The Zuul scheduler is stuck waiting for a response from Gerrit that never comes and there is no timeout.
Workaround: restart zuul to clear the stuck read()
@hashar guess of what happens is roughly:
- Gerrit fails to open the mysql connection (Cannot open ReviewDb)
- the exception does not cause the gerrit query ssh channel to be closed
- Zuul Paramiko connection apparently does not set a timeout
- the thread is stuck
- the scheduler is halted ...
In zuul/connection/gerrit.py, it uses paramiko.SSHClient().exec_command(). The opened channel defaults to not having any timeout, ie it is blocking waiting for some data.
Thread: 140413570619136 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap self.__bootstrap_inner() File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner self.run() File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 128, in run self._handleEvent() File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 116, in _handleEvent event.change_number, event.patch_number, refresh=True) File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange self._updateChange(change, history) File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange data = self.connection.queryChange(change.number) File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 315, in queryChange return self.query('change:%s' % number) File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 301, in query out, err = self._ssh(cmd) File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 391, in _ssh out = stdout.read() File "/usr/lib/python2.7/dist-packages/paramiko/file.py", line 150, in read new_data = self._read(self._DEFAULT_BUFSIZE) File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 1217, in _read return self.channel.recv(size) File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 596, in recv out = self.in_buffer.read(nbytes, self.timeout) File "/usr/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 147, in read self._cv.wait(timeout) File "/usr/lib/python2.7/threading.py", line 340, in wait waiter.acquire()
On Gerrit side I see:
[2020-03-05 08:09:15,471] [SshCommandStart-2] WARN com.google.gerrit.sshd.CommandFactoryProvider : Cannot start command "gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:577096" for user jenkins-bot com.google.inject.ProvisionException: Unable to provision, see the following errors: 1) Cannot open ReviewDb at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:69) (via modules: com.google.gerrit.server.config.GerritGlobalModule -> com.google.gerrit.server.util.ThreadLocalRequestContext$1) while locating com.google.gerrit.reviewdb.server.ReviewDb for the 1st parameter of com.google.gerrit.server.query.change.OutputStreamQuery.<init>(OutputStreamQuery.java:106) while locating com.google.gerrit.server.query.change.OutputStreamQuery for field at com.google.gerrit.sshd.commands.Query.processor(Query.java:27) while locating com.google.gerrit.sshd.commands.Query while locating org.apache.sshd.server.Command annotated with CommandName[gerrit query] 1 error at com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226) at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1053) at com.google.gerrit.sshd.DispatchCommand.start(DispatchCommand.java:95) at com.google.gerrit.sshd.DispatchCommand.start(DispatchCommand.java:122) at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.onStart(CommandFactoryProvider.java:208) at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.access$300(CommandFactoryProvider.java:111) at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline$1.run(CommandFactoryProvider.java:167) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:558) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.google.gwtorm.server.OrmException: Cannot open database connection at com.google.gwtorm.jdbc.Database.newConnection(Database.java:130) at com.google.gwtorm.jdbc.JdbcSchema.<init>(JdbcSchema.java:43) at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$13.<init>(Unknown Source) at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$13_Factory_GwtOrm$$14.open(Unknown Source) at com.google.gwtorm.jdbc.Database.open(Database.java:122) at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:39) at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:25) at com.google.gerrit.server.config.RequestScopedReviewDbProvider.get(RequestScopedReviewDbProvider.java:46) at com.google.gerrit.server.config.RequestScopedReviewDbProvider.get(RequestScopedReviewDbProvider.java:27) at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:69) at com.google.gerrit.server.util.ThreadLocalRequestContext$1$$FastClassByGuice$$75e0eb90.invoke(<generated>) at com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264) at com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173) at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185) at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162) at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42) at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65) at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:113) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:306) at com.google.inject.internal.SingleFieldInjector.inject(SingleFieldInjector.java:52) at com.google.inject.internal.MembersInjectorImpl.injectMembers(MembersInjectorImpl.java:147) at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:124) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:306) at com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:62) at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1050) ... 13 more Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure The last packet successfully received from the server was 3,474 milliseconds ago. The last packet sent successfully to the server was 0 milliseconds ago. at sun.reflect.GeneratedConstructorAccessor582.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.jdbc.Util.handleNewInstance(Util.java:418) at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:873) at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1710) at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1226) at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194) at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225) at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024) at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779) at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47) at sun.reflect.GeneratedConstructorAccessor63.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.mysql.jdbc.Util.handleNewInstance(Util.java:418) at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389) at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330) at com.google.gwtorm.jdbc.SimpleDataSource.getConnection(SimpleDataSource.java:104) at com.google.gwtorm.jdbc.Database.newConnection(Database.java:128) ... 39 more Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost. at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011) at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469) ... 57 more
That is for l10n-bot change https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Collection/+/577096/ when handling the post merge event. It triggers a mwext-phpunit-coverage-docker-publish job.
In the scheduler the last zuul.source.Gerrit log event is:
2020-03-05 08:09:11,991 INFO zuul.source.Gerrit: Updating <Change 0x7fb4776bbf10 577096,1>