Page MenuHomePhabricator

Jenkins: Builds (for beta cluster and browser tests) are stuck forever if IRC notification failed
Closed, ResolvedPublic

Description

Related upstream: https://issues.jenkins-ci.org/browse/JENKINS-28175

Summary:

  • Connection timeout for IRC
  • Regular build timeout setting does not apply to post-build actions like artefacts and notifications.
  • Cancelling the build doesn't work. It remains stuck.
  • Disconnecting the slave and taking the slave offline has no effect.
  • There seems to be no way to recover besides killing the whole Jenkins server and restarting.

The wmf-insecte Jenkins IRC bot no more show in channels.

We have logs at https://integration.wikimedia.org/ci/log/IRC%20IM%20plugin/

Jenkins full thread dump is at P584 and from there:

Two jobs are blocked:

"Executor #2 for integration-slave-trusty-1016 : executing browsertests-CentralNotice-en.wikipedia.beta.wmflabs.org-windows_7-internet_explorer-10-sauce #234" prio=5 BLOCKED
"Executor #1 for integration-slave-trusty-1012 : executing browsertests-PdfHandler-test2.wikipedia.org-linux-firefox-sauce #494" prio=5 BLOCKED

	hudson.plugins.ircbot.v2.IRCConnectionProvider.getInstance(IRCConnectionProvider.java:14)
	hudson.plugins.ircbot.IrcPublisher.getIMConnection(IrcPublisher.java:102)
	hudson.plugins.im.IMPublisher.sendNotification(IMPublisher.java:374)
	hudson.plugins.im.IMPublisher.notifyChatsOnBuildEnd(IMPublisher.java:585)
	hudson.plugins.im.IMPublisher.notifyOnBuildEnd(IMPublisher.java:304)
	hudson.plugins.im.IMPublisher.perform(IMPublisher.java:291)
        ...

.A configuration submit change is blocked as well:

"Handling POST /ci/configSubmit from X.X.X.X : RequestHandlerThread[#1683]" daemon prio=5 WAITING
	sun.misc.Unsafe.park(Native Method)
	java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
	java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
	org.pircbotx.Channel.getMode(Channel.java:127)
	org.pircbotx.Channel.getModeArgument(Channel.java:182)
	org.pircbotx.Channel.getChannelKey(Channel.java:239)
	org.pircbotx.PircBotX.shutdown(PircBotX.java:2872)
	hudson.plugins.ircbot.v2.IRCConnection.close(IRCConnection.java:102)
	hudson.plugins.im.IMConnectionProvider.releaseConnection(IMConnectionProvider.java:92)
	hudson.plugins.ircbot.v2.IRCConnectionProvider.setDesc(IRCConnectionProvider.java:19)
	hudson.plugins.ircbot.IrcPublisher$DescriptorImpl.configure(IrcPublisher.java:336)
	jenkins.model.Jenkins.configureDescriptor(Jenkins.java:2915)
	jenkins.model.Jenkins.doConfigSubmit(Jenkins.java:2878)
        ...


Some other related threads:

"JenkinsIsBusyListener-thread" daemon prio=5 BLOCKED
 hudson.plugins.im.IMConnectionProvider.currentConnection(IMConnectionProvider.java:83)
 hudson.plugins.im.JenkinsIsBusyListener.setStatus(JenkinsIsBusyListener.java:118)
 hudson.plugins.im.JenkinsIsBusyListener.updateIMStatus(JenkinsIsBusyListener.java:109)
 hudson.plugins.im.JenkinsIsBusyListener.access$000(JenkinsIsBusyListener.java:20)
 hudson.plugins.im.JenkinsIsBusyListener$3.run(JenkinsIsBusyListener.java:98)
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 java.util.concurrent.FutureTask.run(FutureTask.java:262)
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
 java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
"IM-Reconnector-Thread" daemon prio=5 BLOCKED
 hudson.plugins.im.IMConnectionProvider$ConnectorRunnable.run(IMConnectionProvider.java:175)
 java.lang.Thread.run(Thread.java:745)

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle moved this task to Backlog on the Jenkins board.
Krinkle subscribed.
tcp6     160      0 208.80.154.135:39429    193.219.128.49:7000     CLOSE_WAIT  1519/java

That is the Jenkins IRC connection to sendak.freenode.net on port 7000 and there is no other. So the IRC plugin is no more connected and indeed wmf-insecte is not showing in #wikimedia-releng.

I finished the CLOSE_WAIT connection by injecting an ACK packet pretending to be from freenode server:

packit -i eth0  -m inject -s 193.219.128.49 -S 7000 -d 208.80.154.135 -D 39429 -F A

That got rid of the left over connection, but did not trigger anything java side. So Jenkins is still stuck. I cant see any other way beside kill -9.

I have stopped Jenkins then kill -9 it and started it back.

The jobs and config saves were blocked attempting to notify to IRC because of a lock in the plugin. The lock was apparently held by a "jenkinsisbusylistener" thread which had in its stacktrace something like an irc.connection.shutdown method.

So I guess the plugin does not timeout when disconnecting from the remote server, and if something goes wild in the terminaison, it ends up being stuck with a lock held :(

After restarting the JenkinsIsBusyListener-thread looks like:

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Krinkle triaged this task as Medium priority.Apr 17 2015, 12:18 PM
Krinkle set Security to None.

Jenkins deadlocked again. I took a threaddump available at P584. I have updated this task details with stack traces of blocked threads.

In jenkins.log

Apr 30, 2015 11:21:37 AM hudson.plugins.im.IMConnectionProvider$ConnectorRunnable run
INFO: Trying to reconnect

When I restarted Jenkins we get:

Apr 30, 2015 12:11:18 PM hudson.plugins.im.IMConnectionProvider$ConnectorRunnable run
INFO: Trying to reconnect
Apr 30, 2015 12:11:19 PM hudson.plugins.ircbot.v2.IRCConnection connect
INFO: Connecting to chat.freenode.net:7000 as wmf-insecte using charset UTF-8
Apr 30, 2015 12:11:29 PM hudson.plugins.ircbot.v2.IRCConnection connect
INFO: connected to IRC

So the plugin is stuck trying to acquire a new connection because the hold one is not properly terminated :(

I have downgraded the IRC plugin from 2.26 to 2.25. The upgrade might have caused the issue. Changelog from https://wiki.jenkins-ci.org/display/JENKINS/IRC+Plugin

Version 2.26 (2015-02-19)

  • don't make concurrent builds wait for the previous build (with instant-messaging-plugin 1.33) issue #26892
  • make delay between messages configurable via system property "hudson.plugins.ircbot.messageRate"
  • try to connect to NickServ protected up to 2 minutes in case NickServ is reacting very slowly

Version 2.25 (Apr 2, 2014)

git log https://github.com/jenkinsci/ircbot-plugin/compare/ircbot-2.25...ircbot-2.26

A commit made to IRCConnection.close() changed:

- this.pircConnection.disconnect();
+ this.pircConnection.shutdown(true);

The method shows up in "Handling POST /ci/configSubmit" thread. Been done to fix a leak JENKINS-25349.

hashar moved this task from Backlog to Reported Upstream on the Upstream board.
hashar moved this task from Backlog to Reported upstream on the Jenkins board.

From discussion on https://github.com/jenkinsci/ircbot-plugin/commit/d18cc7b617155100f8afadb73b324f378c5661da I am going to upgrade our ircbot plugin from 2.25 to 2.27. Upstream has bumped Pircbotx to 2.0.1 which might fix it.

Mentioned in SAL [2016-06-07T09:02:56Z] <hashar> Upgrading Jenkins IRC plugin 2.25..2.27 and instant messaging plugin 1.34..1.35 . The former should fix a deadlock on shutdowning Jenkins | T96183

hashar claimed this task.

Might have been solved via IRC plugin 2.27.