Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Closed, ResolvedPublic

Description

Workaround

To remove the deadlock it is recommended to disconnect Jenkins from the Gearman server and reconnect it. This is done on the https://integration.wikimedia.org/ci/manage page:

Uncheck the box, browse to the bottom and save. That removes the deadlock instantly. After a few seconds, check the box again and save.

If it still fail. Restart Jenkins entirely :(

Upstream bug is https://issues.jenkins-ci.org/browse/JENKINS-25867


From James' email to the QA list:

Beta Labs isn't synchronising; AFAICS it hasn't done so since ~ 11 hours
ago (15:10 UTC on 2014-09-08). I noticed this when prepping a patch for
tomorrow and found that.

Going to https://integration.wikimedia.org/ci/view/Beta/ I found that
"beta-update-databases-eqiad" had been executing for 12 hours, and
initially assumed that we had a run-away update.php issue again. However,
on examining it looks like "deployment-bastion.eqiad", or the jenkins
executor on it, isn't responding in some way:

pending—Waiting for next available executor on deployment-bastion.eqiad

I terminated the beta-update-databases-eqiad run to see if that would help,
but it just switched over to beta-scap-eqiad being the pending task.

Having chatted with MaxSem, I briefly disabled in the jenkins interface the
deployment-bastion.eqiad node and then re-enabled it, to no effect.

Any ideas?


Version: unspecified
Severity: normal

Details

Reference
bz70597
bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz70597.
bzimport added a subscriber: Unknown Object (MLST).
greg created this task.Sep 9 2014, 2:48 AM

This happens once in a while. It's some sort of deadlock in Jenkins itself. Here's how I generally try to resolve it:

Sometimes you have to do this whole dance twice before Jenkins realizes that the there are a bunch of executors that it can use.

This deadlock seems to happen more often than not following or during a database update that is taking a while to complete.

Thanks! Should we write these down somewhere for the next time it occurs?

(In reply to James Forrester from comment #2)

Thanks! Should we write these down somewhere for the next time it occurs?

Greg was way ahead of you -- https://www.mediawiki.org/w/index.php?title=Continuous_integration/Jenkins&diff=1144517&oldid=1091321

Ha. Thanks, both!

Meta: one of the themes for RelEng coming up is improving Jenkins performance, this list of issues is a good place to start.

This happened again today. It seems like it always involves the database update job. Maybe related to https://issues.jenkins-ci.org/browse/JENKINS-10944. Apparently the parent build for a matrix job is not supposed to occupy an executor, but sometimes it does. Seems to be an open bug still upstream.

I have manually changed the config for the beta-update-databases-eqiad job in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep Jenkins from confusing itself. If this "works" then the changes should be backported into the JJB configuration to keep it from coming back the next time the job is updated from config.

  • Maximum Total Concurrent Builds: 2
  • Maximum Concurrent Builds Per Node: 2
  • Throttle Matrix master builds: true
  • Throttle Matrix configuration builds: true

(reopening to keep this on my radar for a few days)

greg added a comment.Sep 23 2014, 10:55 PM

Sept 23rd: 22:51 bd808: Jenkins stuck trying to update database in beta again with the dumb "waiting for executors" bug/problem

(just keeping a log of when it occurs)

(In reply to Bryan Davis from comment #6)

I have manually changed the config for the beta-update-databases-eqiad job
in Jenkins to use the "Throttle Concurrent Builds" in an attempt to keep
Jenkins from confusing itself. If this "works" then the changes should be
backported into the JJB configuration to keep it from coming back the next
time the job is updated from config.

  • Maximum Total Concurrent Builds: 2
  • Maximum Concurrent Builds Per Node: 2
  • Throttle Matrix master builds: true
  • Throttle Matrix configuration builds: true

    (reopening to keep this on my radar for a few days)

The throttle settings are still in place, so they didn't fix the problem. The fact that we didn't see this error for a couple of weeks is apparently uncorrelated.

That happened again on Oct 23 2014. Looking at Jenkins thread dumps, the deployment-bastion execute thread are locked by the Gearman plugin. It must have some logic error somewhere but I can't really debug Java ;(

Un-licking this cookie. I took a shot at it but the problem seems to be deeper in the stack.

I forgot to update this bug after my debugging session on Oct 24th here is the rough brain dump:

The Gearman plugin source code is at https://review.openstack.org/openstack-infra/gearman-plugin.git

I have added a logger for the Gearman plugin ( hudson.plugins.gearman.logger @ INFO ):

https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/

Whenever the issue occurs we can switch it to FINE and get some debug messages.

IIRC the bastion executors threads were held in lock() within src/main/java/hudson/plugins/gearman/NodeAvailabilityMonitor.java lock() function. It takes a worker as parameter and has a wait(5000).

I think the debug message had a 'null' worker possibly ending in a dead end of being always considered busy and thus waiting.

That might be a bad interaction with the jobs that are scheduled by Jenkins itself such as the hourly database update that runs on deployment bastion as well.

And of course: I have no idea how to reproduce it nor how to hook a debugger in Jenkins nor how to dump the state of variables :-/

hashar added a comment.EditedNov 4 2014, 2:22 PM

Created attachment 17012
Raw thread dump from Jenkins

deployment-bastion.eqiad node is in a deadlock right now, I took a thread dump from Jenkins using https://integration.wikimedia.org/ci/monitoring?part=threadsDump

The Jenkins Gearman plugin seems to lock the node executors, with threads looking like:

  "Gearman worker deployment-bastion.eqiad_exec-0" daemon prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388)
	hudson.plugins.gearman.MyGearmanWorkerImpl.handleSessionEvent(MyGearmanWorkerImpl.java:422)
	hudson.plugins.gearman.MyGearmanWorkerImpl$GrabJobEventHandler.handleEvent(MyGearmanWorkerImpl.java:99)
	org.gearman.common.GearmanTask.handleGearmanIOEvent(GearmanTask.java:92)
	org.gearman.common.GearmanJobServerSession.handleResSessionEvent(GearmanJobServerSession.java:329)
	org.gearman.common.GearmanJobServerSession.handleSessionEvent(GearmanJobServerSession.java:240)
	org.gearman.common.GearmanJobServerSession.driveSessionIO(GearmanJobServerSession.java:203)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:357)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:701)

Attached:

hashar added a comment.EditedNov 4 2014, 2:24 PM

From Jenkins log https://integration.wikimedia.org/ci/log/Plugins%20-%20Gearman/ ( hudson.plugins.gearman.logger @ FINEST ) there is a spam of:

AvailabilityMonitor canTake request for deployment-bastion.eqiad_exec-1
Nov 04, 2014 2:15:19 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-bastion.eqiad_exec-1
Nov 04, 2014 2:15:19 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake

The rest of the executors are hold in lock with:

"Gearman worker deployment-bastion.eqiad_exec-2"

daemon prio=5 TIMED_WAITING

java.lang.Object.wait(Native Method)
hudson.plugins.gearman

.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388)
.MyGearmanWorkerImpl.handleSessionEvent(MyGearmanWorkerImpl.java:422)

I disconnected the Jenkins Gearman client which is holding the lock. That unleashed the jobs that are meant to be run on deployment-bastion.eqiad.

I am moving this bug to Continous integration since its root cause is somewhere in Jenkins Gearman plugin. Will make it easier for me to find that bug again.

Most executors are held in lock as mentioned on comment #14. I noticed one thread having a different stacktrace:

"Gearman worker deployment-bastion.eqiad_exec-1" daemon prio=5 WAITING

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
hudson.plugins.gearman
    .StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
hudson.plugins.gearman
    .StartJobWorker.executeFunction(StartJobWorker.java:114)
org.gearman.worker
    .AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
org.gearman.worker
    .AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
hudson.plugins.gearman.MyGearmanWorkerImpl
    .submitFunction(MyGearmanWorkerImpl.java:590)
hudson.plugins.gearman.MyGearmanWorkerImpl
    .work(MyGearmanWorkerImpl.java:374)
hudson.plugins.gearman.AbstractWorkerThread
    .run(AbstractWorkerThread.java:166)
java.lang.Thread.run(Thread.java:701)
  • Bug 73659 has been marked as a duplicate of this bug. ***
hashar set Security to None.

So this is still happening and I really need to fill it upstream.

The executors are locked by:

Gearman worker integration-slave1006_exec-3
java.lang.Object.wait(Native Method)
hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:388)
hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:310)
hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
java.lang.Thread.run(Thread.java:745)

The Jenkins logger for hudson.plugins.gearman.logger show a spam of:

Nov 26, 2014 10:24:21 PM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for >SOME VALUE<

Where >SOME VALUE< is null or one one of the executor thread.

Filled upstream at https://storyboard.openstack.org/#!/story/2000030 . Will have to poke Khai Do (upstream author) about it.

hashar renamed this task from Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) to Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung).Nov 26 2014, 10:48 PM
hashar raised the priority of this task from Normal to High.
hashar claimed this task.
hashar renamed this task from Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) to [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung).Dec 1 2014, 11:57 AM
hashar added a project: Upstream.

You can assume that this upstream bug will not be fixed within our life time, and we should not invest in learning all Jenkins internals to fix it ourselves (it's a significant problem in how Jenkins works).

For the end result of fixing the Beta cluster updates from going stuck indefinitely on a daily basis, you can consider this blocked on T96199: Rewrite beta-update-databases to not use unstable Configuration Matrix.

I have upgraded the gearman plugin earlier from 1.1.1 to f2024bd. From git log --reverse --no-merges 0.1.1..master the three interesting changes are:

commit 6de3cdd29bb8c4336a468985af6f8e0e4fd88e66
Author: James E. Blair <jeblair@hp.com>
Date:   Thu Jan 8 08:37:50 2015 -0800

    Protect against partially initialized executer workers
    
    The registerJobs method of an executorworker can be invoked by an
    external caller before the worker is completely initialized by
    its run method.  We protected against that by checking one instance
    variable, but there's still a race condition involving another.
    Add a check for that variable as well.
    
    Change-Id: I8e2cfffb54aa8a4cf8b1e61e9a9184b091054462

That one might solve the issue we are encountering with executor not being available anymore.

commit 7abfdbd2d00010a1121cefebf479bcf104e7ef18
Author: James E. Blair <jeblair@hp.com>
Date:   Tue May 5 10:38:25 2015 -0700

    Stop sending status updates
    
    Don't send status updates every 10 seconds.  Only send them at the
    start of a job (to fill in information like worker and expected
    duration, etc).  We don't actually do anything with subsequent
    updates, and if Zuul wants to know how long a job has been running
    it's perfectly capable of working that out on its own.
    
    Change-Id: I4df5f82b3375239df35e3bc4b03e1263026f0a68
commit 65a08e0e959b0853538eabeec030d594a01c4385
Author: Clark Boylan <clark.boylan@gmail.com>
Date:   Mon May 4 18:09:31 2015 -0700

    Fix race between adding job and registering
    
    Gearman plugin had a race between adding jobs to the functionList and
    registering jobs. When registering jobs the functionMap is cleared, when
    adding a job the plugin checks if the job is in the function Map before
    running it. If we happen to trigger registration of jobs when we get a
    response from gearman with a job assignment then the functionMap can be
    empty making us send a work fail instead of running the job.
    
    To make things worse this jenkins worker would not send a subsequent
    GET JOB and would live lock never doing any useful work.
    
    Correct this by making the processing for gearman events synchronous in
    the work loop. This ensures that we never try to clear the function map
    and check against it at the same time via different threads. To make
    this happen the handleSessionEvent() method puts all events on a thread
    safe queue for synchronous processing. This has allowed us to simplify
    the work() loop and basically do the following:
    
      while running:
        init()
        register()
        process one event
        run function if processed
        drive IO
    
    This is much easier to reason about as we essentially only have
    bookkeeping and the code for one thing at a time.
    
    Change-Id: Id537710f6c8276a528ad78afd72c5a7c8e8a16ac

Might be blocking jobs as well :/

@Antoine, I gave up on storyboard. We are working on it over in the jira issue tracker.. https://issues.jenkins-ci.org/browse/JENKINS-25867

one additional change you should pick up is https://review.openstack.org/#/c/192429/

@hashar, I gave up on storyboard. We are working on it over in the jira issue tracker.. https://issues.jenkins-ci.org/browse/JENKINS-25867

one additional change you should pick up is https://review.openstack.org/#/c/192429/

Ah thanks for switching to JIRA, this way we get mail notifications :-D I followed up there.

Rebuilding the plugin with https://review.openstack.org/#/c/192429/2

git fetch https://review.openstack.org/openstack-infra/gearman-plugin refs/changes/29/192429/2 && git checkout FETCH_HEAD
mvn -Dproject-version="`git describe`-change_192429_2" -DskipTests=true  clean package

Thus upgrading the plugin from 0.1.1-8-gf2024bd to 0.1.1-9-g08e9c42-change_192429_2.

hashar closed this task as Resolved.Jul 22 2015, 1:05 PM

I haven't noticed the error since July 1st nor does Jenkins logs show any null lock. Thus seems the fix in gearman plugin fixed it.

hashar reopened this task as Open.Jul 28 2015, 10:34 AM

It happened again :(

Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null

With beta-scap-eqiad and beta-update-database-eqiad being stuck waiting for an available executor on deployment-bastion.

Marking the node offline and online doesn't remove the lock :-/

The executor threads have:

"Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING
	java.lang.Object.wait(Native Method)
	java.lang.Object.wait(Object.java:503)
	hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
	hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
	hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114)
	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
	hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the jobs managed to run.

So maybe it is an issue in Jenkins itself :-}

Change 227440 had a related patch set uploaded (by Hashar):
beta: expand {datacenter} to 'eqiad'

https://gerrit.wikimedia.org/r/227440

Change 227441 had a related patch set uploaded (by Hashar):
beta: disambiguate Jenkins label from node name

https://gerrit.wikimedia.org/r/227441

Change 227440 merged by jenkins-bot:
beta: expand {datacenter} to 'eqiad'

https://gerrit.wikimedia.org/r/227440

Change 227441 merged by jenkins-bot:
beta: disambiguate Jenkins label from node name

https://gerrit.wikimedia.org/r/227441

I renamed the Jenkins label to disambiguate the node name and the label (now BetaClusterBastion).

hashar added a comment.EditedDec 16 2015, 11:41 AM

That still happens from time to time with Jenkins 1.625.3 and the Gearman Plugin 0.1.3.3.01da2d4 (which is 1.3.3 + https://review.openstack.org/#/c/252768/ ).

To remove the deadlock one can either:

hashar updated the task description. (Show Details)Dec 16 2015, 11:43 AM
hashar updated the task description. (Show Details)
hashar moved this task from Backlog to Reported Upstream on the Upstream board.
hashar removed hashar as the assignee of this task.

Upstream https://review.openstack.org/#/c/252768/ has been abandoned in favor of https://review.openstack.org/#/c/271543/ . It uses a different internal API which should no more return null. Specially replaces:

- Computer.currentComputer()
+ Jenkins.getActiveInstance().getComputer("")

But that is solely to properly get the master node on which we run no job. So unlikely to fix anything for us.

Updated the Gearman plugin to 0.1.3.3.a5164d6

Locked up again. I don't quite understand the instructions in the summary that say you can disable the gearman plugin without restarting Jenkins. Doesn't enable/disable of a plugin require a restart?

hashar updated the task description. (Show Details)Feb 4 2016, 12:30 PM

Sorry @bd808 the phrasing wasn't perfect. There is no need to disable the Gearman plugin, just have to disconnect it from the Gearman server which is done in the Jenkins manage page:

Doing so causes the Gearman client on Jenkins to disconnect from Zuul Gearman server and stop overriding the Jenkins executors/reset state which get rid of the deadlock.

I have updated the task summary with above screenshot (that shows my leet Paint skills).

I have updated the task summary with above screenshot (that shows my leet Paint skills).

Thanks @hashar. I'd forgotten about that setting and was instead thinking that the instruction was to disable/enable the plugin at https://integration.wikimedia.org/ci/pluginManager/.

Krinkle removed a subscriber: Krinkle.Feb 23 2016, 4:49 PM
Danny_B renamed this task from [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) to Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung).May 31 2016, 3:14 PM
Danny_B removed a subscriber: wikibugs-l-list.
Danny_B removed a subscriber: Language-Team.

Blind search-and-replace was incorrect here. This is not a Language-Team bug, we were only CCed.

hashar lowered the priority of this task from High to Normal.Jul 27 2016, 9:12 AM
hashar closed this task as Resolved.
hashar claimed this task.

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).

Assuming it is fixed.