Page MenuHomePhabricator

CI / Zuul not processing changes
Closed, ResolvedPublic

Description

For some reason, CI might end up in a stall state. The Zuul scheduler is stuck waiting for a response from Gerrit that never comes and there is no timeout.

Workaround: restart zuul to clear the stuck read()

@hashar guess of what happens is roughly:

  • Gerrit fails to open the mysql connection (Cannot open ReviewDb)
  • the exception does not cause the gerrit query ssh channel to be closed
  • Zuul Paramiko connection apparently does not set a timeout
  • the thread is stuck
  • the scheduler is halted ...

In zuul/connection/gerrit.py, it uses paramiko.SSHClient().exec_command(). The opened channel defaults to not having any timeout, ie it is blocking waiting for some data.

Zuul thread dump
Thread: 140413570619136
  File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 128, in run
    self._handleEvent()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 116, in _handleEvent
    event.change_number, event.patch_number, refresh=True)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
    self._updateChange(change, history)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange
    data = self.connection.queryChange(change.number)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 315, in queryChange
    return self.query('change:%s' % number)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 301, in query
    out, err = self._ssh(cmd)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 391, in _ssh
    out = stdout.read()
  File "/usr/lib/python2.7/dist-packages/paramiko/file.py", line 150, in read
    new_data = self._read(self._DEFAULT_BUFSIZE)
  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 1217, in _read
    return self.channel.recv(size)
  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 596, in recv
    out = self.in_buffer.read(nbytes, self.timeout)
  File "/usr/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 147, in read
    self._cv.wait(timeout)
  File "/usr/lib/python2.7/threading.py", line 340, in wait
    waiter.acquire()

On Gerrit side I see:

[2020-03-05 08:09:15,471] [SshCommandStart-2] WARN  com.google.gerrit.sshd.CommandFactoryProvider : Cannot start command "gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:577096" for user jenkins-bot
com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Cannot open ReviewDb
  at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:69) (via modules: com.google.gerrit.server.config.GerritGlobalModule -> com.google.gerrit.server.util.ThreadLocalRequestContext$1)
  while locating com.google.gerrit.reviewdb.server.ReviewDb
    for the 1st parameter of com.google.gerrit.server.query.change.OutputStreamQuery.<init>(OutputStreamQuery.java:106)
  while locating com.google.gerrit.server.query.change.OutputStreamQuery
    for field at com.google.gerrit.sshd.commands.Query.processor(Query.java:27)
  while locating com.google.gerrit.sshd.commands.Query
  while locating org.apache.sshd.server.Command annotated with CommandName[gerrit query]

1 error
        at com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
        at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1053)
        at com.google.gerrit.sshd.DispatchCommand.start(DispatchCommand.java:95)
        at com.google.gerrit.sshd.DispatchCommand.start(DispatchCommand.java:122)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.onStart(CommandFactoryProvider.java:208)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.access$300(CommandFactoryProvider.java:111)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline$1.run(CommandFactoryProvider.java:167)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:558)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.gwtorm.server.OrmException: Cannot open database connection
        at com.google.gwtorm.jdbc.Database.newConnection(Database.java:130)
        at com.google.gwtorm.jdbc.JdbcSchema.<init>(JdbcSchema.java:43)
        at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$13.<init>(Unknown Source)
        at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$13_Factory_GwtOrm$$14.open(Unknown Source)
        at com.google.gwtorm.jdbc.Database.open(Database.java:122)
        at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:39)
        at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:25)
        at com.google.gerrit.server.config.RequestScopedReviewDbProvider.get(RequestScopedReviewDbProvider.java:46)
        at com.google.gerrit.server.config.RequestScopedReviewDbProvider.get(RequestScopedReviewDbProvider.java:27)
        at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:69)
        at com.google.gerrit.server.util.ThreadLocalRequestContext$1$$FastClassByGuice$$75e0eb90.invoke(<generated>)
        at com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264)
        at com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
        at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
        at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
        at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42)
        at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65)
        at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:113)
        at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91)
        at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:306)
        at com.google.inject.internal.SingleFieldInjector.inject(SingleFieldInjector.java:52)
        at com.google.inject.internal.MembersInjectorImpl.injectMembers(MembersInjectorImpl.java:147)
        at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:124)
        at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91)
        at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:306)
        at com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:62)
        at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1050)
        ... 13 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 3,474 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
        at sun.reflect.GeneratedConstructorAccessor582.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:418)
        at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:873)
        at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1710)
        at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1226)
        at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
        at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
        at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
        at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
        at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
        at sun.reflect.GeneratedConstructorAccessor63.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:418)
        at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
        at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
        at com.google.gwtorm.jdbc.SimpleDataSource.getConnection(SimpleDataSource.java:104)
        at com.google.gwtorm.jdbc.Database.newConnection(Database.java:128)
        ... 39 more
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469)
        ... 57 more

That is for l10n-bot change https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Collection/+/577096/ when handling the post merge event. It triggers a mwext-phpunit-coverage-docker-publish job.

In the scheduler the last zuul.source.Gerrit log event is:

2020-03-05 08:09:11,991 INFO zuul.source.Gerrit: Updating <Change 0x7fb4776bbf10 577096,1>

Event Timeline

It seems that Zuul no more receives any events through Gerrit events streaming :(

DannyS712 triaged this task as High priority.EditedMar 5 2020, 9:47 AM
DannyS712 added a project: User-DannyS712.
DannyS712 subscribed.

Was about to report this myself

Could this be related to T246903: Investigate Zuul crashing due to BlueSpice REL1_31 backlog?

Zuul stack dump:

There is apparently a thread stuck while querying Gerrit for a change:

Thread: 140413570619136
  File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 128, in run
    self._handleEvent()
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 116, in _handleEvent
    event.change_number, event.patch_number, refresh=True)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 173, in _getChange
    self._updateChange(change, history)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/source/gerrit.py", line 238, in _updateChange
    data = self.connection.queryChange(change.number)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 315, in queryChange
    return self.query('change:%s' % number)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 301, in query
    out, err = self._ssh(cmd)
  File "/usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/connection/gerrit.py", line 391, in _ssh
    out = stdout.read()
  File "/usr/lib/python2.7/dist-packages/paramiko/file.py", line 150, in read
    new_data = self._read(self._DEFAULT_BUFSIZE)
  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 1217, in _read
    return self.channel.recv(size)
  File "/usr/lib/python2.7/dist-packages/paramiko/channel.py", line 596, in recv
    out = self.in_buffer.read(nbytes, self.timeout)
  File "/usr/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 147, in read
    self._cv.wait(timeout)
  File "/usr/lib/python2.7/threading.py", line 340, in wait
    waiter.acquire()

Mentioned in SAL (#wikimedia-operations) [2020-03-05T09:53:04Z] <hashar> Restarting Zuul, it no more process Gerrit events due to a thread stuck waiting on Gerrit.. T246973

On Gerrit side I see:

[2020-03-05 08:09:15,471] [SshCommandStart-2] WARN  com.google.gerrit.sshd.CommandFactoryProvider : Cannot start command "gerrit query --format json --all-approvals --comments --commit-message --current-patch-set --dependencies --files --patch-sets --submit-records change:577096" for user jenkins-bot
com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Cannot open ReviewDb
  at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:69) (via modules: com.google.gerrit.server.config.GerritGlobalModule -> com.google.gerrit.server.util.ThreadLocalRequestContext$1)
  while locating com.google.gerrit.reviewdb.server.ReviewDb
    for the 1st parameter of com.google.gerrit.server.query.change.OutputStreamQuery.<init>(OutputStreamQuery.java:106)
  while locating com.google.gerrit.server.query.change.OutputStreamQuery
    for field at com.google.gerrit.sshd.commands.Query.processor(Query.java:27)
  while locating com.google.gerrit.sshd.commands.Query
  while locating org.apache.sshd.server.Command annotated with CommandName[gerrit query]

1 error
        at com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
        at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1053)
        at com.google.gerrit.sshd.DispatchCommand.start(DispatchCommand.java:95)
        at com.google.gerrit.sshd.DispatchCommand.start(DispatchCommand.java:122)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.onStart(CommandFactoryProvider.java:208)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.access$300(CommandFactoryProvider.java:111)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline$1.run(CommandFactoryProvider.java:167)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:558)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.gwtorm.server.OrmException: Cannot open database connection
        at com.google.gwtorm.jdbc.Database.newConnection(Database.java:130)
        at com.google.gwtorm.jdbc.JdbcSchema.<init>(JdbcSchema.java:43)
        at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$13.<init>(Unknown Source)
        at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$13_Factory_GwtOrm$$14.open(Unknown Source)
        at com.google.gwtorm.jdbc.Database.open(Database.java:122)
        at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:39)
        at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:25)
        at com.google.gerrit.server.config.RequestScopedReviewDbProvider.get(RequestScopedReviewDbProvider.java:46)
        at com.google.gerrit.server.config.RequestScopedReviewDbProvider.get(RequestScopedReviewDbProvider.java:27)
        at com.google.gerrit.server.util.ThreadLocalRequestContext$1.provideReviewDb(ThreadLocalRequestContext.java:69)
        at com.google.gerrit.server.util.ThreadLocalRequestContext$1$$FastClassByGuice$$75e0eb90.invoke(<generated>)
        at com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264)
        at com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
        at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
        at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
        at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42)
        at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65)
        at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:113)
        at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91)
        at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:306)
        at com.google.inject.internal.SingleFieldInjector.inject(SingleFieldInjector.java:52)
        at com.google.inject.internal.MembersInjectorImpl.injectMembers(MembersInjectorImpl.java:147)
        at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:124)
        at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:91)
        at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:306)
        at com.google.inject.internal.FactoryProxy.get(FactoryProxy.java:62)
        at com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1050)
        ... 13 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 3,474 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
        at sun.reflect.GeneratedConstructorAccessor582.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:418)
        at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:873)
        at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1710)
        at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1226)
        at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
        at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
        at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
        at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
        at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
        at sun.reflect.GeneratedConstructorAccessor63.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:418)
        at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
        at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
        at com.google.gwtorm.jdbc.SimpleDataSource.getConnection(SimpleDataSource.java:104)
        at com.google.gwtorm.jdbc.Database.newConnection(Database.java:128)
        ... 39 more
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469)
        ... 57 more

That is for l10n-bot change https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Collection/+/577096/ when handling the post merge event. It triggers a mwext-phpunit-coverage-docker-publish job.

In the scheduler the last zuul.source.Gerrit log event is:

2020-03-05 08:09:11,991 INFO zuul.source.Gerrit: Updating <Change 0x7fb4776bbf10 577096,1>

And at this point the thread is locked waiting for data?

My guess of what happens is roughly:

  • Gerrit fails to open the mysql connection (Cannot open ReviewDb)
  • the exception does not cause the gerrit query ssh channel to be closed
  • Zuul Paramiko connection apparently does not set a timeout
  • the thread is stuck
  • the scheduler is halted ...

The fix has been to restart Zuul, though restarting Gerrit might have solved it as well.

Maybe we can add a timeout in zuul/connection/gerrit.py. It uses paramiko.SSHClient().exec_command(). The opened channel defaults to not having any timeout, ie it is blocking waiting for some data.

hashar claimed this task.

Mentioned in SAL (#wikimedia-operations) [2020-03-11T12:54:31Z] <thcipriani> restarting gerrit to try to fix thread deadlock on zuul (cf: T246973 )

Mentioned in SAL (#wikimedia-operations) [2020-03-11T13:01:54Z] <thcipriani> restarting gerrit unstuck the zuul server (T246973)

Reopening since that still needs to be investigated and the root cause acted on.

hashar updated the task description. (Show Details)

I have filed T247591 for the DBA, there might be some errors showing up on the database side.

hashar changed the task status from Open to Stalled.Mar 13 2020, 10:15 AM

Stall for now, the workaround is to restart Gerrit to clear up the socket read() call.

T247591 to investigate the connection issue.

It is unlikely we will figure out the root cause anytime soon, hence I am closing it for now since it hasn't occurred since March 11st.

That can later be monitored in logstash by querying host:gerrit1001 AND "Cannot open ReviewDb" / https://logstash.wikimedia.org/goto/a5e73386bf3e1a2f0650577153e9d5d8

Happened again and restarting Gerrit cause Zuul to eventually flush the whole event queue since it can not find the projects by querying Gerrit.

hashar updated the task description. (Show Details)
thcipriani subscribed.

If the problem here was Gerrit not being able to open a connection to ReviewDB that should have been solved by the 3.2 upgrade that removed ReviewDB. Boldly closing.