Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	greg
	Sep 9 2014, 2:48 AM

Description

Workaround

To remove the deadlock it is recommended to disconnect Jenkins from the Gearman server and reconnect it. This is done on the https://integration.wikimedia.org/ci/manage page:

jenkins-gearman-disconnect.png (205×590 px, 17 KB)

Uncheck the box, browse to the bottom and save. That removes the deadlock instantly. After a few seconds, check the box again and save.

If it still fail. Restart Jenkins entirely :(

Upstream bug is https://issues.jenkins-ci.org/browse/JENKINS-25867

From James' email to the QA list:

Beta Labs isn't synchronising; AFAICS it hasn't done so since ~ 11 hours
ago (15:10 UTC on 2014-09-08). I noticed this when prepping a patch for
tomorrow and found that.

Going to https://integration.wikimedia.org/ci/view/Beta/ I found that
"beta-update-databases-eqiad" had been executing for 12 hours, and
initially assumed that we had a run-away update.php issue again. However,
on examining it looks like "deployment-bastion.eqiad", or the jenkins
executor on it, isn't responding in some way:

pending—Waiting for next available executor on deployment-bastion.eqiad

I terminated the beta-update-databases-eqiad run to see if that would help,
but it just switched over to beta-scap-eqiad being the pending task.

Having chatted with MaxSem, I briefly disabled in the jenkins interface the
deployment-bastion.eqiad node and then re-enabled it, to no effect.

Any ideas?

November 2014 thread dump:

jenkins-threads-dump.txt135 KBDownload

/ https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMy8tLWplbmtpbnMtdGhyZWFkcy1kdW1wLnR4dC0tMTAtMjctMw==

July 2019 one:

Another threaddump P8736
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDcvMTEvLS10aHJlYWRkdW1wLnR4dC0tOS0zMC0zMg==

Details

Reference: bz70597

Subject	Repo	Branch	Lines +/-
Use a Zuul mutex for the coverage patch jobs	integration/config	master	+13 -3
beta: disambiguate Jenkins label from node name	integration/config	master	+5 -5
beta: expand {datacenter} to 'eqiad'	integration/config	master	+23 -25

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T72597 Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Declined	None	T183164 Move the beta cluster jobs to a dedicated/standalone Jenkins instance
Declined	None	T188367 Use cron instead of Jenkins for beta deployments
Resolved	hashar	T234259 Non concurrent mwcore-codehealth-master-non-voting cause Gearman executors to be locked
Open	None	T256168 Move beta cluster automatic deployment to a dedicated infrastructure
Resolved	taavi	T314378 Stop triggering `beta-scap-sync-world` on `beta-mediawiki-config-update-eqiad` completion

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Antoine, I gave up on storyboard. We are working on it over in the jira issue tracker.. https://issues.jenkins-ci.org/browse/JENKINS-25867

one additional change you should pick up is https://review.openstack.org/#/c/192429/

In T72597#1415140, @zaro0508 wrote:

@hashar, I gave up on storyboard. We are working on it over in the jira issue tracker.. https://issues.jenkins-ci.org/browse/JENKINS-25867

one additional change you should pick up is https://review.openstack.org/#/c/192429/

Ah thanks for switching to JIRA, this way we get mail notifications :-D I followed up there.

Rebuilding the plugin with https://review.openstack.org/#/c/192429/2

git fetch https://review.openstack.org/openstack-infra/gearman-plugin refs/changes/29/192429/2 && git checkout FETCH_HEAD
mvn -Dproject-version="`git describe`-change_192429_2" -DskipTests=true  clean package

Thus upgrading the plugin from 0.1.1-8-gf2024bd to 0.1.1-9-g08e9c42-change_192429_2.

I haven't noticed the error since July 1st nor does Jenkins logs show any null lock. Thus seems the fix in gearman plugin fixed it.

It happened again :(

Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null
Jul 28, 2015 10:30:35 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for null

With beta-scap-eqiad and beta-update-database-eqiad being stuck waiting for an available executor on deployment-bastion.

Marking the node offline and online doesn't remove the lock :-/

The executor threads have:

"Gearman worker deployment-bastion.eqiad_exec-1" prio=5 WAITING
	java.lang.Object.wait(Native Method)
	java.lang.Object.wait(Object.java:503)
	hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:73)
	hudson.plugins.gearman.StartJobWorker.safeExecuteFunction(StartJobWorker.java:196)
	hudson.plugins.gearman.StartJobWorker.executeFunction(StartJobWorker.java:114)
	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:125)
	org.gearman.worker.AbstractGearmanFunction.call(AbstractGearmanFunction.java:22)
	hudson.plugins.gearman.MyGearmanWorkerImpl.submitFunction(MyGearmanWorkerImpl.java:593)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:328)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-2" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-3" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-4" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

"Gearman worker deployment-bastion.eqiad_exec-5" prio=5 TIMED_WAITING
	java.lang.Object.wait(Native Method)
	hudson.plugins.gearman.NodeAvailabilityMonitor.lock(NodeAvailabilityMonitor.java:83)
	hudson.plugins.gearman.MyGearmanWorkerImpl.sendGrabJob(MyGearmanWorkerImpl.java:380)
	hudson.plugins.gearman.MyGearmanWorkerImpl.processSessionEvent(MyGearmanWorkerImpl.java:421)
	hudson.plugins.gearman.MyGearmanWorkerImpl.work(MyGearmanWorkerImpl.java:320)
	hudson.plugins.gearman.AbstractWorkerThread.run(AbstractWorkerThread.java:166)
	java.lang.Thread.run(Thread.java:745)

The node is named deployment-bastion-eqiad, with a label deployment-bastion-eqiad. Jobs are tied to deployment-bastion-eqiad.

The workaround I found was to remove the label from the node. Once done, the jobs shows in the queue with 'no node having label deployment-bastion-eqiad'. I then applied the label again on the host and the jobs managed to run.

So maybe it is an issue in Jenkins itself :-}

Change 227440 had a related patch set uploaded (by Hashar):
beta: expand {datacenter} to 'eqiad'

https://gerrit.wikimedia.org/r/227440

Change 227441 had a related patch set uploaded (by Hashar):
beta: disambiguate Jenkins label from node name

https://gerrit.wikimedia.org/r/227441

Change 227440 merged by jenkins-bot:
beta: expand {datacenter} to 'eqiad'

https://gerrit.wikimedia.org/r/227440

hashar mentioned this in rCICF2b1075de8322: beta: expand {datacenter} to 'eqiad'.Jul 28 2015, 11:30 AM

Change 227441 merged by jenkins-bot:
beta: disambiguate Jenkins label from node name

https://gerrit.wikimedia.org/r/227441

hashar mentioned this in rCICFfda81657e28f: beta: disambiguate Jenkins label from node name.Jul 28 2015, 11:41 AM

I renamed the Jenkins label to disambiguate the node name and the label (now BetaClusterBastion).

That still happens from time to time with Jenkins 1.625.3 and the Gearman Plugin 0.1.3.3.01da2d4 (which is 1.3.3 + https://review.openstack.org/#/c/252768/ ).

To remove the deadlock one can either:

restart Jenkins
disable the gearman plugin ( https://integration.wikimedia.org/ci/manage ) which would instantly unlock. Then reenable

hashar removed hashar as the assignee of this task.Dec 16 2015, 11:43 AM

hashar updated the task description. (Show Details)

hashar moved this task from Backlog to Reported Upstream on the Upstream board.

hashar merged a task: T75659: Beta not picking up merged change.Dec 16 2015, 11:57 AM

Upstream https://review.openstack.org/#/c/252768/ has been abandoned in favor of https://review.openstack.org/#/c/271543/ . It uses a different internal API which should no more return null. Specially replaces:

- Computer.currentComputer()
+ Jenkins.getActiveInstance().getComputer("")

But that is solely to properly get the master node on which we run no job. So unlikely to fix anything for us.

Updated the Gearman plugin to 0.1.3.3.a5164d6

Locked up again. I don't quite understand the instructions in the summary that say you can disable the gearman plugin without restarting Jenkins. Doesn't enable/disable of a plugin require a restart?

Sorry @bd808 the phrasing wasn't perfect. There is no need to disable the Gearman plugin, just have to disconnect it from the Gearman server which is done in the Jenkins manage page:

Doing so causes the Gearman client on Jenkins to disconnect from Zuul Gearman server and stop overriding the Jenkins executors/reset state which get rid of the deadlock.

I have updated the task summary with above screenshot (that shows my leet Paint skills).

In T72597#1997547, @hashar wrote:

I have updated the task summary with above screenshot (that shows my leet Paint skills).

Thanks @hashar. I'd forgotten about that setting and was instead thinking that the instruction was to disable/enable the plugin at https://integration.wikimedia.org/ci/pluginManager/.

Krinkle unsubscribed.Feb 23 2016, 4:49 PM

Danny_B renamed this task from [upstream] Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) to Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung).May 31 2016, 3:14 PM

Danny_B removed a subscriber: • wikibugs-l-list.

Danny_B added a project: Language-Team.Jul 27 2016, 12:08 AM

Danny_B removed a subscriber: Language-Team.

Blind search-and-replace was incorrect here. This is not a Language-Team bug, we were only CCed.

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).

Assuming it is fixed.

In T72597#2498254, @hashar wrote:

This still happen albeit very rarely nowadays to a point it is almost a non issue. I have only noticed it once over the last few months and the root cause was unrelated (thread starvation in the Jenkins SSH plugin that caused it to no more properly connect slaves).

So, this is still happening but I'm unsure of the root cause. It happened again today and happened a few weeks ago: http://tools.wmflabs.org/sal/log/AV9ZS8j8F4fsM4DBdUuo (2017-10-26). Once a month is annoying.

Assuming it is fixed.

Well, something isn't :P

hashar mentioned this in T181313: zuul/jenkins has jobs stuck in postmerge for 13 hours.Nov 24 2017, 10:06 PM

That still happens indeed (T181313). Maybe we can move the beta cluster jobs to a dedicated/standalone Jenkins instance.

Restricted Application edited projects, added Release-Engineering-Team (Kanban); removed Release-Engineering-Team. · View Herald TranscriptNov 24 2017, 10:10 PM

greg edited projects, added Release-Engineering-Team (Watching / External); removed Release-Engineering-Team (Kanban).Dec 18 2017, 5:54 PM

Potentially we would create a dedicated Jenkins to drive beta cluster which is T183164. Unassigning since I am focusing on other duties.

• demon closed subtask T183164: Move the beta cluster jobs to a dedicated/standalone Jenkins instance as Declined.Feb 27 2018, 6:06 PM

greg added a subtask: T188367: Use cron instead of Jenkins for beta deployments.Feb 27 2018, 6:07 PM

Change 419674 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Use a Zuul mutex for the coverage patch jobs

https://gerrit.wikimedia.org/r/419674

gerritbot added a project: Patch-For-Review.Mar 15 2018, 7:48 AM

Change 419674 merged by jenkins-bot:
[integration/config@master] Use a Zuul mutex for the coverage patch jobs

https://gerrit.wikimedia.org/r/419674

In T72597#747925, @bd808 wrote:

This happens once in a while. It's some sort of deadlock in Jenkins itself. Here's how I generally try to resolve it:

Take deployment-bastion offline in Jenkins https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/markOffline

Kill any jenkins jobs running on deployment-bastion via Jenkins UI

Kill all pending jobs in the Jenkins queue that are "waiting on executors"

Disconnect deployment-bastion https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/disconnect

Bring deployment-bastion back online (button labeled "Bring this node back online")

Launch slave agent (there's a button that says this)

Check agent log to see that it connected https://integration.wikimedia.org/ci/computer/deployment-bastion.eqiad/log

Sometimes you have to do this whole dance twice before Jenkins realizes that the there are a bunch of executors that it can use.

This deadlock seems to happen more often than not following or during a database update that is taking a while to complete.

^ just had to do this for deployment-tin

Mentioned in SAL (#wikimedia-releng) [2018-07-04T20:15:19Z] <Reedy> beta unbroke beta code autodeploy T72597

Krinkle awarded a token.Jul 5 2018, 12:57 AM

hashar updated the task description. (Show Details)May 3 2019, 10:28 AM

• Phabricator_maintenance added a project: Release-Engineering-Team-TODO.Jun 12 2019, 11:45 PM

• Phabricator_maintenance moved this task from Should be empty (use Release-Engineering-Team) to Watching/External on the Release-Engineering-Team-TODO board.Jun 12 2019, 11:48 PM

• Phabricator_maintenance removed a project: Release-Engineering-Team (Watching / External).Jun 12 2019, 11:49 PM

greg added a project: Release-Engineering-Team.Jun 21 2019, 10:35 PM

Mentioned in SAL (#wikimedia-releng) [2019-07-11T09:06:25Z] <hashar> beta cluster jobs are dead locked. Taking a thread dump in case it helps figure out what is going on. T72597

Another threaddump P8736
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDcvMTEvLS10aHJlYWRkdW1wLnR4dC0tOS0zMC0zMg==

In the Jenkins log bucket hudson.model.Queue.

First the builds in the queue:

FINE hudson.model.Queue

Queue maintenance started on hudson.model.Queue@2ea587fb with Queue.Snapshot{
waitingList=[
  hudson.model.Queue$WaitingItem:hudson.model.FreeStyleProject@1956d4b9[mwgate-node10-docker]:116546,
  hudson.model.Queue$WaitingItem:hudson.model.FreeStyleProject@35e87319[mwext-php72-phan-docker]:116547,
  hudson.model.Queue$WaitingItem:hudson.model.FreeStyleProject@fc512c3[mwext-php70-phan-seccheck-docker]:116548
];
blockedProjects=[
  hudson.model.Queue$BlockedItem:hudson.model.FreeStyleProject@34bd6b8b[beta-mediawiki-config-update-eqiad]:113951,
  hudson.model.Queue$BlockedItem:hudson.model.FreeStyleProject@4988adf9[beta-code-update-eqiad]:113983];
  buildables=[
    hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@38d6ee0d[beta-scap-eqiad]:113952,
    hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@c175473[beta-update-databases-eqiad]:114190,
    hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@7728417c[beta-publish-deb]:116220];
    pendings=[]
}

FINER hudson.model.Queue

Failed to map hudson.model.Queue$BuildableItem:hudson.model.FreeStyleProject@38d6ee0d[beta-scap-eqiad]:113952 to executors.
candidates=[]
parked=[
JobOffer[integration-r-lang-01 #0], JobOffer[integration-slave-docker-1043 #2], JobOffer[integration-slave-docker-1050 #3], JobOffer[saucelabs-01 #0], JobOffer[saucelabs-02 #0], JobOffer[deployment-deploy01 #3], JobOffer[integration-slave-jessie-1001 #0], JobOffer[integration-castor03 #0], JobOffer[compiler1002.puppet-diffs.eqiad.wmflabs #0], JobOffer[integration-slave-docker-1050 #0], JobOffer[integration-slave-jessie-1002 #0], JobOffer[compiler1001.puppet-diffs.eqiad.wmflabs #0], JobOffer[integration-slave-jessie-1004 #0], JobOffer[integration-slave-docker-1043 #0], JobOffer[integration-slave-docker-1040 #1], JobOffer[integration-slave-docker-1040 #2], JobOffer[integration-slave-docker-1043 #1], JobOffer[webperformance #0], JobOffer[deployment-deploy01 #2], JobOffer[integration-slave-docker-1058 #0], JobOffer[integration-slave-docker-1059 #3], JobOffer[integration-trigger-01 #4], JobOffer[integration-slave-docker-1048 #0], JobOffer[integration-trigger-01 #7], JobOffer[integration-trigger-01 #9], JobOffer[deployment-deploy01 #1], JobOffer[integration-slave-docker-1054 #2], JobOffer[integration-slave-docker-1050 #1], JobOffer[integration-slave-docker-1059 #1], JobOffer[integration-slave-docker-1059 #0], JobOffer[integration-trigger-01 #6], JobOffer[integration-trigger-01 #5], JobOffer[integration-slave-docker-1058 #1], JobOffer[integration-trigger-01 #1], JobOffer[integration-slave-docker-1051 #2], JobOffer[integration-slave-docker-1054 #1], JobOffer[integration-trigger-01 #3], JobOffer[contint1001 #2], JobOffer[contint1001 #0], JobOffer[deployment-deploy01 #0], JobOffer[integration-slave-docker-1048 #1], JobOffer[integration-slave-docker-1051 #0], JobOffer[integration-trigger-01 #0], JobOffer[integration-trigger-01 #2]
]

It has JobOffer instances for deployment-deploy01 #0 to #3 which would be the four executors on that Jenkins instance.

Same message for all three jobs: beta-scap-eqiad, beta-update-databases-eqiad and beta-publish-deb.

core/src/main/java/hudson/model/Queue.java

List<JobOffer> candidates = new ArrayList<>(parked.size());
List<CauseOfBlockage> reasons = new ArrayList<>(parked.size());
for (JobOffer j : parked.values()) {
    CauseOfBlockage reason = j.getCauseOfBlockage(p);
    if (reason == null) {
        LOGGER.log(Level.FINEST,
                "{0} is a potential candidate for task {1}",
                new Object[]{j, taskDisplayName});
        candidates.add(j);
    } else {
        LOGGER.log(Level.FINEST, "{0} rejected {1}: {2}", new Object[] {j, taskDisplayName, reason});
        reasons.add(reason);
    }
}

MappingWorksheet ws = new MappingWorksheet(p, candidates);
Mapping m = loadBalancer.map(p.task, ws);
if (m == null) {
    // if we couldn't find the executor that fits,
    // just leave it in the buildables list and
    // check if we can execute other projects
    LOGGER.log(Level.FINER, "Failed to map {0} to executors. candidates={1} parked={2}",
            new Object[]{p, candidates, parked.values()});
    p.transientCausesOfBlockage = reasons.isEmpty() ? null : reasons;
    continue;
}

Mentioned in SAL (#wikimedia-releng) [2019-07-11T09:48:47Z] <hashar> jenkins: add more log details to hudson.model.Queue (FINER > FINEST) https://integration.wikimedia.org/ci/log/Jenkins%20Queue/configure # T72597

The Jenkins logger ( https://integration.wikimedia.org/ci/log/Jenkins%20Queue/ ) missed the FINEST log level. The reasons are the same as shown in the web gui, for example:

JobOffer[deployment-deploy01 #0] rejected beta-update-databases-eqiad: Waiting for next available executor on ‘deployment-deploy01’

The Gearman plugin does implement a canTake method:

src/main/java/hudson/plugins/gearman/NodeAvailabilityMonitor.java

public boolean canTake(Queue.BuildableItem item)
{
    // Jenkins calls this from within the scheduler maintenance
    // function (while owning the queue monitor).  If we are
    // locked, only allow the build we are expecting to run.
    logger.debug("AvailabilityMonitor canTake request for " +
                 workerHoldingLock);

    NodeParametersAction param = item.getAction(NodeParametersAction.class);
    if (param != null) {
        logger.debug("AvailabilityMonitor canTake request for UUID " +
                     param.getUuid() + " expecting " + expectedUUID);

        if (expectedUUID == param.getUuid()) {
            return true;
        }
    }
    return (workerHoldingLock == null);
}

And in the log:

Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0
Jul 11, 2019 10:01:51 AM FINE hudson.plugins.gearman.NodeAvailabilityMonitor canTake
AvailabilityMonitor canTake request for deployment-deploy01_exec-0

So it seems blocked because there are no parameters somehow even though there should at least be a uuid/id :-\

greg edited projects, added Release-Engineering-Team (CI & Testing services); removed Release-Engineering-Team.Aug 1 2019, 11:12 PM

A related issues are the mwext-codehealth jobs, they are made to not run concurrently. Sometime one would see several of them pending in the Jenkins build queue and some of the Jenkins agents are idling although they should be running jobs.

The builds are queued by Jenkins, but the Gearman plugin already assigned a node for those builds. The node assignment can be seen in /var/lib/jenkins/queue.xml:

<hudson.model.Queue_-State>
  <items>
    <hudson.model.Queue_-BlockedItem>
      <actions>
        <hudson.plugins.gearman.NodeAssignmentAction plugin="gearman-plugin@0.2.0.3.e27817f">
          <labelAtom>integration-agent-docker-1009</labelAtom>
        </hudson.plugins.gearman.NodeAssignmentAction>
...
    <hudson.model.Queue_-BlockedItem>
      <actions>
        <hudson.plugins.gearman.NodeAssignmentAction plugin="gearman-plugin@0.2.0.3.e27817f">
          <labelAtom>integration-agent-docker-1005</labelAtom>
        </hudson.plugins.gearman.NodeAssignmentAction>

I don't have the details, but Gearman is thus unable to use any of the executors on those two nodes until the build queued by Jenkins starts executing.

It might be related to the lock we occasionally have for deployment-prep.

hashar mentioned this in T234259: Non concurrent mwcore-codehealth-master-non-voting cause Gearman executors to be locked.Sep 30 2019, 7:12 PM

hashar closed subtask T234259: Non concurrent mwcore-codehealth-master-non-voting cause Gearman executors to be locked as Resolved.Oct 4 2019, 8:52 AM

bd808 unsubscribed.Oct 4 2019, 5:50 PM

thcipriani removed a project: Release-Engineering-Team (CI & Testing services).Apr 20 2021, 1:09 AM

thcipriani edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team-TODO.Apr 20 2021, 3:33 AM

thcipriani moved this task from Limbo to Watching/External on the Release-Engineering-Team (Radar) board.Apr 20 2021, 3:34 AM

hashar closed subtask T188367: Use cron instead of Jenkins for beta deployments as Declined.Jun 4 2021, 12:41 PM

hashar added a subtask: T256168: Move beta cluster automatic deployment to a dedicated infrastructure.

I took a heap dump on contint2001 which is at /var/lib/jenkins/202201281527.hprof

jnuche mentioned this in T302905: Add lock to scap prep.Apr 1 2022, 9:30 AM

hashar mentioned this in T307963: Config patches are stuck in the postmerge queue.May 9 2022, 8:41 PM

TheresNoTime subscribed.May 24 2022, 11:36 AM

Mentioned in SAL (#wikimedia-releng) [2022-06-30T22:02:36Z] <TheresNoTime> unstuck beta-mediawiki-config-update-eqiad jobs, will comment at T72597

The last few "sets" of beta-mediawiki-config-update-eqiad jobs have got stuck and needed manual actions (i.e. cancelling all other pending beta deployment jobs repeatedly until the backlog of beta-mediawiki-config-update-eqiad jobs have completed)

To note, beta-scap-sync-world gets stuck waiting on beta-mediawiki-config-update-eqiad with the error

#57646
cancel this build
(pending—Waiting for next available executor on ‘deployment-deploy03’; ‘contint1001’ doesn’t have label ‘BetaClusterBastion’; ‘contint2001’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1023’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1024’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1025’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1026’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1027’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1028’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1029’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1030’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1031’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1032’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1033’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1034’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1035’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1036’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1037’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1038’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-docker-1039’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-pkgbuilder-1001’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-pkgbuilder-1002’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-puppet-docker-1003’ doesn’t have label ‘BetaClusterBastion’; ‘integration-agent-qemu-1003’ doesn’t have label ‘BetaClusterBastion’; ‘integration-castor05’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1001.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1002.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’; ‘pcc-worker1003.puppet-diffs.eqiad1.wikimedia.cloud’ doesn’t have label ‘BetaClusterBastion’)

nb. just sat and watched a set of deploys (what else do you do at 11pm?) — this seems to occur when a beta-mediawiki-config-update-eqiad is running and a beta-code-update-eqiad job is triggered via timer. There's either no lockfile to prevent the two from running at the same time, or it ignores it?

Mentioned in SAL (#wikimedia-releng) [2022-07-07T21:10:41Z] <TheresNoTime> clear stuck beta deployment jobs, T72597

Mentioned in SAL (#wikimedia-releng) [2022-07-07T22:42:26Z] <TheresNoTime> clear stuck beta deployment jobs (again), T72597

Mentioned in SAL (#wikimedia-releng) [2022-08-02T07:55:36Z] <TheresNoTime> cleared stuck beta deployment jobs T72597

TheresNoTime mentioned this in T314378: Stop triggering `beta-scap-sync-world` on `beta-mediawiki-config-update-eqiad` completion.Aug 2 2022, 11:40 AM

Mentioned in SAL (#wikimedia-releng) [2022-08-04T10:01:13Z] <TheresNoTime> clearing out stuck beta deployment jobs T314378 T72597

dancy closed subtask T314378: Stop triggering `beta-scap-sync-world` on `beta-mediawiki-config-update-eqiad` completion as Resolved.Sep 14 2022, 5:01 PM

Pppery removed a project: Patch-For-Review.Apr 8 2023, 5:46 PM

I am tentatively marking this as resolved since I haven't seen it happen in quite a while.

	F3309780: jenkins-gearman-disconnect.png
	Feb 4 2016, 12:30 PM

Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Jenkins Gearman plugin has deadlock on executor threads (was: Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung)
Closed, ResolvedPublic
Actions

Related Objects
Search...