Page MenuHomePhabricator

Jenkins: Builds (for beta cluster and browser tests) are stuck forever if IRC notification failed
Closed, ResolvedPublic

Description

Related upstream: https://issues.jenkins-ci.org/browse/JENKINS-28175

Summary:

  • Connection timeout for IRC
  • Regular build timeout setting does not apply to post-build actions like artefacts and notifications.
  • Cancelling the build doesn't work. It remains stuck.
  • Disconnecting the slave and taking the slave offline has no effect.
  • There seems to be no way to recover besides killing the whole Jenkins server and restarting.

The wmf-insecte Jenkins IRC bot no more show in channels.

We have logs at https://integration.wikimedia.org/ci/log/IRC%20IM%20plugin/

Jenkins full thread dump is at P584 and from there:

Two jobs are blocked:

"Executor #2 for integration-slave-trusty-1016 : executing browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce #234" prio=5 BLOCKED
"Executor #1 for integration-slave-trusty-1012 : executing browsertests-PdfHandler-test2.wikipedia.org-linux-firefox-sauce #494" prio=5 BLOCKED

	hudson.plugins.ircbot.v2.IRCConnectionProvider.getInstance(IRCConnectionProvider.java:14)
	hudson.plugins.ircbot.IrcPublisher.getIMConnection(IrcPublisher.java:102)
	hudson.plugins.im.IMPublisher.sendNotification(IMPublisher.java:374)
	hudson.plugins.im.IMPublisher.notifyChatsOnBuildEnd(IMPublisher.java:585)
	hudson.plugins.im.IMPublisher.notifyOnBuildEnd(IMPublisher.java:304)
	hudson.plugins.im.IMPublisher.perform(IMPublisher.java:291)
        ...

.A configuration submit change is blocked as well:

"Handling POST /ci/configSubmit from X.X.X.X : RequestHandlerThread[#1683]" daemon prio=5 WAITING
	sun.misc.Unsafe.park(Native Method)
	java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
	java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
	org.pircbotx.Channel.getMode(Channel.java:127)
	org.pircbotx.Channel.getModeArgument(Channel.java:182)
	org.pircbotx.Channel.getChannelKey(Channel.java:239)
	org.pircbotx.PircBotX.shutdown(PircBotX.java:2872)
	hudson.plugins.ircbot.v2.IRCConnection.close(IRCConnection.java:102)
	hudson.plugins.im.IMConnectionProvider.releaseConnection(IMConnectionProvider.java:92)
	hudson.plugins.ircbot.v2.IRCConnectionProvider.setDesc(IRCConnectionProvider.java:19)
	hudson.plugins.ircbot.IrcPublisher$DescriptorImpl.configure(IrcPublisher.java:336)
	jenkins.model.Jenkins.configureDescriptor(Jenkins.java:2915)
	jenkins.model.Jenkins.doConfigSubmit(Jenkins.java:2878)
        ...


Some other related threads:

"JenkinsIsBusyListener-thread" daemon prio=5 BLOCKED
 hudson.plugins.im.IMConnectionProvider.currentConnection(IMConnectionProvider.java:83)
 hudson.plugins.im.JenkinsIsBusyListener.setStatus(JenkinsIsBusyListener.java:118)
 hudson.plugins.im.JenkinsIsBusyListener.updateIMStatus(JenkinsIsBusyListener.java:109)
 hudson.plugins.im.JenkinsIsBusyListener.access$000(JenkinsIsBusyListener.java:20)
 hudson.plugins.im.JenkinsIsBusyListener$3.run(JenkinsIsBusyListener.java:98)
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 java.util.concurrent.FutureTask.run(FutureTask.java:262)
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
"IM-Reconnector-Thread" daemon prio=5 BLOCKED
 hudson.plugins.im.IMConnectionProvider$ConnectorRunnable.run(IMConnectionProvider.java:175)
 java.lang.Thread.run(Thread.java:745)

Event Timeline

Krinkle created this task.Apr 15 2015, 7:37 PM
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle moved this task to Backlog on the Jenkins board.
Krinkle added a subscriber: Krinkle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 15 2015, 7:37 PM
hashar added a subscriber: hashar.Apr 15 2015, 7:44 PM
tcp6     160      0 208.80.154.135:39429    193.219.128.49:7000     CLOSE_WAIT  1519/java

That is the Jenkins IRC connection to sendak.freenode.net on port 7000 and there is no other. So the IRC plugin is no more connected and indeed wmf-insecte is not showing in #wikimedia-releng.

I finished the CLOSE_WAIT connection by injecting an ACK packet pretending to be from freenode server:

packit -i eth0  -m inject -s 193.219.128.49 -S 7000 -d 208.80.154.135 -D 39429 -F A

That got rid of the left over connection, but did not trigger anything java side. So Jenkins is still stuck. I cant see any other way beside kill -9.

I have stopped Jenkins then kill -9 it and started it back.

The jobs and config saves were blocked attempting to notify to IRC because of a lock in the plugin. The lock was apparently held by a "jenkinsisbusylistener" thread which had in its stacktrace something like an irc.connection.shutdown method.

So I guess the plugin does not timeout when disconnecting from the remote server, and if something goes wild in the terminaison, it ends up being stuck with a lock held :(

After restarting the JenkinsIsBusyListener-thread looks like:

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Krinkle triaged this task as Normal priority.Apr 17 2015, 12:18 PM
Krinkle set Security to None.
hashar updated the task description. (Show Details)EditedApr 30 2015, 12:08 PM

Jenkins deadlocked again. I took a threaddump available at P584. I have updated this task details with stack traces of blocked threads.

In jenkins.log

Apr 30, 2015 11:21:37 AM hudson.plugins.im.IMConnectionProvider$ConnectorRunnable run
INFO: Trying to reconnect

When I restarted Jenkins we get:

Apr 30, 2015 12:11:18 PM hudson.plugins.im.IMConnectionProvider$ConnectorRunnable run
INFO: Trying to reconnect
Apr 30, 2015 12:11:19 PM hudson.plugins.ircbot.v2.IRCConnection connect
INFO: Connecting to chat.freenode.net:7000 as wmf-insecte using charset UTF-8
Apr 30, 2015 12:11:29 PM hudson.plugins.ircbot.v2.IRCConnection connect
INFO: connected to IRC

So the plugin is stuck trying to acquire a new connection because the hold one is not properly terminated :(

hashar added a comment.EditedApr 30 2015, 1:43 PM

I have downgraded the IRC plugin from 2.26 to 2.25. The upgrade might have caused the issue. Changelog from https://wiki.jenkins-ci.org/display/JENKINS/IRC+Plugin

Version 2.26 (2015-02-19)

  • don't make concurrent builds wait for the previous build (with instant-messaging-plugin 1.33) issue #26892
  • make delay between messages configurable via system property "hudson.plugins.ircbot.messageRate"
  • try to connect to NickServ protected up to 2 minutes in case NickServ is reacting very slowly

Version 2.25 (Apr 2, 2014)

git log https://github.com/jenkinsci/ircbot-plugin/compare/ircbot-2.25...ircbot-2.26

A commit made to IRCConnection.close() changed:

- this.pircConnection.disconnect();
+ this.pircConnection.shutdown(true);

The method shows up in "Handling POST /ci/configSubmit" thread. Been done to fix a leak JENKINS-25349.

hashar updated the task description. (Show Details)Jun 9 2015, 8:12 PM
hashar moved this task from Backlog to Reported Upstream on the Upstream board.
hashar moved this task from Backlog to Reported upstream on the Jenkins board.
Krinkle removed a subscriber: Krinkle.May 19 2016, 5:19 PM
hashar added a comment.Jun 7 2016, 9:01 AM

From discussion on https://github.com/jenkinsci/ircbot-plugin/commit/d18cc7b617155100f8afadb73b324f378c5661da I am going to upgrade our ircbot plugin from 2.25 to 2.27. Upstream has bumped Pircbotx to 2.0.1 which might fix it.

Mentioned in SAL [2016-06-07T09:02:56Z] <hashar> Upgrading Jenkins IRC plugin 2.25..2.27 and instant messaging plugin 1.34..1.35 . The former should fix a deadlock on shutdowning Jenkins | T96183

hashar moved this task from Reported Upstream to Patch merged upstream on the Upstream board.
hashar closed this task as Resolved.May 2 2017, 8:32 PM
hashar claimed this task.

Might have been solved via IRC plugin 2.27.