WebPageTest job fails intermittently with "java.io.IOException: Unexpected termination of the channel"
Closed, ResolvedPublic

Description

LOG https://integration.wikimedia.org/ci/log/hudson.remoting/

After we updated the job to use to Jessie in Nodepool (to get latest NodeJS) our tests started to fail every now and then. All of them looks like this:

11:44:08 Started by timer
11:44:08 Building remotely on ci-jessie-wikimedia-680465 (ci-jessie-wikimedia) in workspace /home/jenkins/workspace/performance-webpagetest-wmf
11:44:09 Using checkout strategy: Default
11:44:09 Last Built Revision: Revision 0990644dc1780e1bc04c1e09473928308b9ac61d (origin/master)
11:44:12 FATAL: java.io.IOException: Unexpected termination of the channel
11:44:12 java.io.EOFException
11:44:12 	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2634)
11:44:12 	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3109)
11:44:12 	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:851)
11:44:12 	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:348)
11:44:12 	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
11:44:12 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
11:44:12 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
11:44:12 Caused: java.io.IOException: Unexpected termination of the channel
11:44:12 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
11:44:12 Caused: hudson.remoting.RequestAbortedException
11:44:12 	at hudson.remoting.Request.abort(Request.java:307)
11:44:12 	at hudson.remoting.Channel.terminate(Channel.java:896)
11:44:12 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:92)
11:44:12 	at ......remote call to ci-jessie-wikimedia-680465(Native Method)
11:44:12 	at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
11:44:12 	at hudson.remoting.Request.call(Request.java:172)
11:44:12 	at hudson.remoting.Channel.call(Channel.java:829)
11:44:12 	at hudson.FilePath.act(FilePath.java:985)
11:44:12 	at hudson.FilePath.act(FilePath.java:974)
11:44:12 	at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:137)
11:44:12 	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:751)
11:44:12 	at hudson.plugins.git.GitSCM.createClient(GitSCM.java:743)
11:44:12 	at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1101)
11:44:12 	at hudson.scm.SCM.checkout(SCM.java:496)
11:44:12 	at hudson.model.AbstractProject.checkout(AbstractProject.java:1281)
11:44:12 	at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:604)
11:44:12 	at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
11:44:12 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529)
11:44:12 	at hudson.model.Run.execute(Run.java:1728)
11:44:12 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
11:44:12 	at hudson.model.ResourceController.execute(ResourceController.java:98)
11:44:12 	at hudson.model.Executor.run(Executor.java:405)
11:44:12 IRC notifier plugin: Sending notification to: #wikimedia-perf
11:44:12 ERROR: Step ‘Delete workspace when build is done’ failed: no workspace for performance-webpagetest-wmf #7184
11:44:12 Finished: FAILURE

https://integration.wikimedia.org/ci/job/performance-webpagetest-wmf/7184/

Probably related upstream bugs:

https://issues.jenkins-ci.org/browse/JENKINS-23767
https://issues.jenkins-ci.org/browse/JENKINS-25858

To find all affected builds, on contint1001.wikimedia.org :

find /srv/jenkins/builds  -maxdepth 3 -path '*/*/log' -print0|xargs -0 -n100 grep -lH java.io.EOFException
Peter created this task.May 30 2017, 1:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2017, 1:47 PM
Krinkle renamed this task from Failing WebPageTests in Jenkins to WebPageTest job fails intemittently with "java.io.IOException: Unexpected termination of the channel".May 30 2017, 10:35 PM
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: hashar.
Gilles triaged this task as High priority.May 31 2017, 8:00 PM
Gilles moved this task from Inbox to Blocked on the Performance-Team board.
hashar claimed this task.May 31 2017, 8:15 PM

Workspace is supposedly: /home/jenkins/workspace/performance-webpagetest-wmf

The stack trace comes from the Git Client Plugin:

at org.jenkinsci.plugins.gitclient.Git.getClient(Git.java:137)

And they have released a new version:

2.4.6 (May 24, 2017)

(error) Fix clone failures when workspace path contains specific special characters (JENKINS-43931, JENKINS-44041, JENKINS-44127, and JENKINS-44420)

https://issues.jenkins-ci.org/browse/JENKINS-43931
https://issues.jenkins-ci.org/browse/JENKINS-44041
https://issues.jenkins-ci.org/browse/JENKINS-44127
https://issues.jenkins-ci.org/browse/JENKINS-44420

Fixed by https://github.com/jenkinsci/git-client-plugin/pull/243

Who knows why it randomly happens. Lets upgrade the plugin and claim it is fixed :]

Mentioned in SAL (#wikimedia-releng) [2017-05-31T20:21:10Z] <hashar> Jenkins: upgrading git-client-plugin 2.4.5..2.4.6 T166557

Should be good now. Lets monitor for a day or so and we can then mark this solved.

hashar closed this task as Resolved.Jun 5 2017, 10:31 AM

That was an issue in the Jenkins plugin for git. Solved by upgrading.

Peter reopened this task as Open.Jun 14 2017, 5:57 PM

We still got it: https://integration.wikimedia.org/ci/job/performance-webpagetest-wmf/7506/console

17:19:35 FATAL: java.io.IOException: Unexpected termination of the channel

https://integration.wikimedia.org/ci/job/performance-webpagetest-wmf/7506/console has the exact same stack trace. It ran on an instance named ci-jessie-wikimedia-700525.

There is apparently 3 seconds elapsing at the beginning before the communication channel between the Jenkins master and slave is terminated. https://wiki.jenkins-ci.org/display/JENKINS/Remoting+issue has a bunch of documentation about such issue and does mention a 3 seconds timeout.

The build ran on June 14th at 17:19:32 and terminated at 17:19:36:

17:19:32 Started by timer
17:19:32 Building remotely on ci-jessie-wikimedia-700525 (ci-jessie-wikimedia) in workspace /home/jenkins/workspace/performance-webpagetest-wmf
17:19:33 Using checkout strategy: Default
17:19:33 Last Built Revision: Revision a6e81e7713aae5d3c167f44bd0d6ff6359c1fdb4 (origin/master)
17:19:35 FATAL: java.io.IOException: Unexpected termination of the channel
17:19:35 java.io.EOFException
...
17:19:36 Caused: java.io.IOException: Unexpected termination of the channel
...
17:19:36 Caused: hudson.remoting.RequestAbortedException
...
17:19:36 IRC notifier plugin: Sending notification to: #wikimedia-perf
17:19:36 ERROR: Step ‘Delete workspace when build is done’ failed: no workspace for performance-webpagetest-wmf #7506
17:19:36 Finished: FAILURE

In the Jenkins server side logs I see:

17:19:25 INFO: [hudson.slaves.SlaveComputer tryReconnect] Attempting to reconnect ci-jessie-wikimedia-700525
17:19:31 WARNING: [hudson.node_monitors.ResponseTimeMonitor$1 monitor] Making ci-jessie-wikimedia-700525 offline because it’s not responding 
17:19:31 INFO: [hudson.plugins.gearman.AbstractWorkerThread run] ---- Starting Worker ci-jessie-wikimedia-700525_exec-0 (Wed Jun 14 17:19:31 UTC 2017)
17:19:31 INFO: [hudson.plugins.gearman.MyGearmanWorkerImpl setFunctions] ---- Worker ci-jessie-wikimedia-700525_exec-0 registering 308 functions
17:19:31 INFO: [hudson.plugins.gearman.MyGearmanWorkerImpl work] ---- Worker ci-jessie-wikimedia-700525_exec-0 starting work
17:19:32 [06/14/17 17:19:32] SSH Launch of ci-jessie-wikimedia-700525 on 10.68.21.189 completed in 7,444 ms

The connection to the slave got lost. It is reconnected and the jobs it can run are registered back to Zuul Gearman server. What is puzzling me is:

17:19:31 ... Worker ci-jessie-wikimedia-700525_exec-0 starting work

Which might indicate that it started a work from Zuul. Maybe that is racing with the web performance job that is not triggered by Zuul but by the Jenkins scheduler. I am dismissing that since from the zuul log no builds got triggered at that time.

Then still in Jenkins logs:

17:19:35 SEVERE: [hudson.remoting.SynchronousCommandTransport$ReaderThread run] I/O error in channel ci-jessie-wikimedia-700525

[hudson.plugins.gearman.AbstractWorkerThread stop]
17:19:35 INFO: ---- ci-jessie-wikimedia-700525_exec-0 Request to stop AWT: hudson.plugins.gearman.ExecutorWorkerThread@38392461
17:19:35 INFO: ---- ci-jessie-wikimedia-700525_exec-0   Thread: Thread[Gearman worker ci-jessie-wikimedia-700525_exec-0,5,main] name: Gearman worker ci-jessie-wikimedia-700525_exec-0
17:19:35 INFO: ---- ci-jessie-wikimedia-700525_exec-0   Worker: ci-jessie-wikimedia-700525_exec-0
17:19:35 INFO: ---- ci-jessie-wikimedia-700525_exec-0   Interrupting worker
17:19:35 INFO: ---- ci-jessie-wikimedia-700525_exec-0 Stop request done

[hudson.plugins.gearman.MyGearmanWorkerImpl shutDownWorker]
17:19:35 INFO: ---- Worker ci-jessie-wikimedia-700525_exec-0 commencing shutdown
17:19:35 INFO: ---- Worker ci-jessie-wikimedia-700525_exec-0 completed shutdown

17:19:36 INFO: [org.jruby.javasupport.JavaMethod invokeDirectWithExceptionHandling] Taking single use slave 'ci-jessie-wikimedia-700525' offline because of labels (ci-jessie-wikimedia)
hashar added a comment.EditedJun 15 2017, 8:32 AM

I found a bunch of errors in jenkins.log such as: SEVERE: [hudson.remoting.SynchronousCommandTransport$ReaderThread run] I/O error in channel ci-jessie-wikimedia-XXXXX

From this morning, two instances had the same error, although they were not processing builds:

06:55:52 I/O error in channel ci-jessie-wikimedia-701473
06:55:52 I/O error in channel ci-jessie-wikimedia-701475

One caused a build failure:

06:55:51 I/O error in channel ci-jessie-wikimedia-701472
mwext-testextension-hhvm-jessie #7826 main build action completed: FAILURE

https://integration.wikimedia.org/ci/job/mwext-testextension-hhvm-jessie/7826/console

06:55:49 Syncing...
06:55:49 + rsync --archive --compress --contimeout 3 rsync://castor.integration.eqiad.wmflabs:/caches/castor-mw-ext-and-skins/master/mwext-testextension-hhvm-jessie/ /home/jenkins
06:55:51 FATAL: command execution failed
06:55:51 java.io.EOFException
06:55:51 	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2634)
06:55:51 	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3109)
06:55:51 	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:851)
06:55:51 	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:348)
06:55:51 	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
06:55:51 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
06:55:51 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
06:55:51 Caused: java.io.IOException: Unexpected termination of the channel
06:55:51 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
06:55:51 Caused: java.io.IOException: Backing channel 'ci-jessie-wikimedia-701472' is disconnected.
06:55:51 	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
06:55:51 	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
06:55:51 	at com.sun.proxy.$Proxy71.isAlive(Unknown Source)
06:55:51 	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043)
06:55:51 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035)
06:55:51 	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
06:55:51 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
06:55:51 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
06:55:51 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
06:55:51 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:779)
06:55:51 	at hudson.model.Build$BuildExecution.build(Build.java:206)
06:55:51 	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
06:55:51 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:534)
06:55:51 	at hudson.model.Run.execute(Run.java:1728)
06:55:51 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
06:55:51 	at hudson.model.ResourceController.execute(ResourceController.java:98)
06:55:51 	at hudson.model.Executor.run(Executor.java:405)
06:55:51 Build step 'Execute shell' marked build as failure

So somehow Jenkins loose connections to the agent for whatever reason. And it is not solely for the web performance jobs.

All the stacktraces mention java.io.EOFException. The Remoting issue doc mentions:

Failure mode on the master:
On master, the details of the connection loss is captured into the slave log file, and not the build log. So you'll have to go find archived slave launch logs under $JENKINS_HOME/logs/slaves/*/slave.log*, which records 10 more recent connection logs to that slave. This information is also captured in the support bundle from Support Core Plugin under /nodes/slave/*/launchLog/*. If you look at the bottom of them, you can see how the last 10 connections to the slave has ended.

Unfortunately since we delete the instances, the logs are garbage collected.

But I found a node that had the issue with logs still being around

ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:24] [SSH] Opening SSH connection to 10.68.23.40:22.
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:24] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:24] [SSH] Authentication successful.
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:24] [SSH] The remote users environment is:
...
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:24] [SSH] Checking java version of java
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:25] [SSH] java -version returned 1.7.0_131.
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:25] [SSH] Starting sftp client.
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:25] [SSH] Copying latest slave.jar...
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:25] [SSH] Copied 719,269 bytes.
ci-jessie-wikimedia-701557/slave.log.1:Expanded the channel window size to 4MB
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:25] [SSH] Starting slave process: cd "/home/jenkins" && java  -jar slave.jar
ci-jessie-wikimedia-701557/slave.log.1:<===[JENKINS REMOTING CAPACITY]===>^@^@^@channel started
ci-jessie-wikimedia-701557/slave.log.1:Slave.jar version: 3.7
ci-jessie-wikimedia-701557/slave.log.1:This is a Unix agent
ci-jessie-wikimedia-701557/slave.log.1:Evacuated stdout
ci-jessie-wikimedia-701557/slave.log.1:Agent successfully connected and online
ci-jessie-wikimedia-701557/slave.log.1:Slave JVM has not reported exit code. Is it still running?
ci-jessie-wikimedia-701557/slave.log.1:ERROR: Connection terminated
ci-jessie-wikimedia-701557/slave.log.1:[m[8mha:////4NMGEJDq8IZuRNXJVEAXgI8X8nOAql+8gFC5sOsysWZfAAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[m[0mjava.io.EOFException
ci-jessie-wikimedia-701557/slave.log.1: at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2634)
ci-jessie-wikimedia-701557/slave.log.1: at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3109)
ci-jessie-wikimedia-701557/slave.log.1: at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:851)
ci-jessie-wikimedia-701557/slave.log.1: at java.io.ObjectInputStream.<init>(ObjectInputStream.java:348)
ci-jessie-wikimedia-701557/slave.log.1: at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
ci-jessie-wikimedia-701557/slave.log.1: at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
ci-jessie-wikimedia-701557/slave.log.1: at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
ci-jessie-wikimedia-701557/slave.log.1:Caused: java.io.IOException: Unexpected termination of the channel
ci-jessie-wikimedia-701557/slave.log.1: at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)
ci-jessie-wikimedia-701557/slave.log.1:[06/15/17 08:47:45] [SSH] Connection closed.

Reconnection:

ci-jessie-wikimedia-701525/slave.log:Agent successfully connected and online
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Opening SSH connection to 10.68.23.40:22.
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Authentication successful.
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] The remote users environment is:
...
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Checking java version of java
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] java -version returned 1.7.0_131.
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Starting sftp client.
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Copying latest slave.jar...
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Copied 719,269 bytes.
ci-jessie-wikimedia-701557/slave.log:Expanded the channel window size to 4MB
ci-jessie-wikimedia-701557/slave.log:[06/15/17 08:47:56] [SSH] Starting slave process: cd "/home/jenkins" && java  -jar slave.jar
ci-jessie-wikimedia-701557/slave.log:<===[JENKINS REMOTING CAPACITY]===>^@^@^@channel started
ci-jessie-wikimedia-701557/slave.log:Slave.jar version: 3.7
ci-jessie-wikimedia-701557/slave.log:This is a Unix agent
ci-jessie-wikimedia-701557/slave.log:Evacuated stdout
ci-jessie-wikimedia-701557/slave.log:Agent successfully connected and online
08:47:24 Attempting to reconnect ci-jessie-wikimedia-701557
08:47:33 Making ci-jessie-wikimedia-701557 offline because it’s not responding
hashar added a comment.EditedJun 15 2017, 9:21 AM

From a zgrep -c 'I/O' /var/log/jenkins/jenkins*:

Log fileFile timeI/O match
jenkins.lognow12
jenkins.log.1Jun 15 6:2591
jenkins.log.2.gzJun 14 6:2554
jenkins.log.3.gzJun 13 6:2561
jenkins.log.4.gzJun 11 6:2573
jenkins.log.5.gzJun 9 6:2535
jenkins.log.6.gzJun 8 6:2555
jenkins.log.7.gzJun 7 6:2575
jenkins.log.8.gzJun 4 6:2528
jenkins.log.9.gzJun 2 6:2538
jenkins.log.10.gzJun 1 6:2556
jenkins.log.11.gzMay 31 6:2578
jenkins.log.12.gzMay 29 6:2564
jenkins.log.13.gzMay 26 6:25111
jenkins.log.14.gzMay 24 6:2547
jenkins.log.15.gzMay 23 6:2599
oldoldold
jenkins.log.17.gzFeb 15 20:06 3
jenkins.log.18.gzFeb 15 6:2510
jenkins.log.19.gzFeb 14 6:252
jenkins.log.20.gzFeb 13 6:25 0

I am absolutely at what could be the issue. It could be a nasty bug in Jenkins itself or the network sessions being interrupted somehow. But really I am clueless :(

Aklapper renamed this task from WebPageTest job fails intemittently with "java.io.IOException: Unexpected termination of the channel" to WebPageTest job fails intermittently with "java.io.IOException: Unexpected termination of the channel".Jun 15 2017, 10:55 AM
hashar updated the task description. (Show Details)Jun 15 2017, 2:22 PM
hashar updated the task description. (Show Details)Jun 15 2017, 2:39 PM
hashar updated the task description. (Show Details)Jun 15 2017, 2:47 PM

See https://wiki.jenkins-ci.org/display/JENKINS/Remoting+issue for removing issues (which the traceback leads to)

Finally had a node showing the issue and kept around by nodepool. Node is ci-jessie-wikimedia-715440

On Jenkins side:

Jun 27 08:33:41 WARNING: Making ci-jessie-wikimedia-715440 offline because it’s not responding
Jun 27 08:33:45 SSH Launch of ci-jessie-wikimedia-715440 on 10.68.20.246 completed in 9,159 ms
Jun 27 08:33:48 SEVERE: [hudson.remoting.SynchronousCommandTransport$ReaderThread run] I/O error in channel ci-jessie-wikimedia-715440
Jun 27 08:33:48 contint1001 jenkins[22715]: WARNING: [hudson.plugins.gearman.MyGearmanWorkerImpl submitFunction] ---- Worker ci-jessie-wikimedia-715440_exec-0 exception while executing function hudson.plugins.gearman.StartJobWorker

The connection is retried a few times and eventually the agent is put online. From the agent log

[06/27/17 08:34:06] [SSH] Opening SSH connection to 10.68.20.246:22.
[06/27/17 08:34:06] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
[06/27/17 08:34:06] [SSH] Authentication successful.
[06/27/17 08:34:06] [SSH] The remote users environment is:
<snip>
[06/27/17 08:34:06] [SSH] Checking java version of java
[06/27/17 08:34:06] [SSH] java -version returned 1.7.0_131.
[06/27/17 08:34:06] [SSH] Starting sftp client.
[06/27/17 08:34:06] [SSH] Copying latest slave.jar...
[06/27/17 08:34:06] [SSH] Copied 719,269 bytes.
Expanded the channel window size to 4MB
[06/27/17 08:34:06] [SSH] Starting slave process: cd "/home/jenkins" && java  -jar slave.jar
<===[JENKINS REMOTING CAPACITY]===>channel started
Slave.jar version: 3.7
This is a Unix agent
Evacuated stdout
Agent successfully connected and online

Although eventually it is connected:

Jun 27 08:34:09 contint1001 jenkins[22715]: [06/27/17 08:34:09] SSH Launch of ci-jessie-wikimedia-715440 on 10.68.20.246 completed in 3,468 ms

... the computer page shows a Connection was broken

java.io.EOFException
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2634)
	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3109)
	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:851)
	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:348)
	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
Caused: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:73)

And from https://integration.wikimedia.org/ci/log/SEVERE/ and https://integration.wikimedia.org/ci/log/hudson.remoting/

Jun 27, 2017 8:33:48 AM SEVERE hudson.remoting.SynchronousCommandTransport$ReaderThread run
I/O error in channel ci-jessie-wikimedia-715440
java.io.EOFException
Jun 27, 2017 8:33:55 AM SEVERE hudson.remoting.SynchronousCommandTransport$ReaderThread run
I/O error in channel ci-jessie-wikimedia-715440
java.io.EOFException
Jun 27, 2017 8:34:03 AM SEVERE hudson.remoting.SynchronousCommandTransport$ReaderThread run
I/O error in channel ci-jessie-wikimedia-715440
java.io.EOFException

Zuul tried to schedule a build but the host got an I/O error meantime which causes the status of the build to be None:

2017-06-27 08:33:39,597 INFO zuul.Gearman: Launch job mwext-VisualEditor-publish (uuid: 6fcec9a20b61401caf386d67dd02738b) for change <Change 0x7f60f1af5a90 361489,1> with dependent changes []
2017-06-27 08:33:48,504 DEBUG zuul.IndependentPipelineManager: Build <Build 6fcec9a20b61401caf386d67dd02738b of mwext-VisualEditor-publish on <Worker Unknown>> completed
2017-06-27 08:33:48,504 DEBUG zuul.IndependentPipelineManager: Item <QueueItem 0x7f60f44f6210 for <Change 0x7f60f1af5a90 361489,1> in postmerge> status is now:
 Project mediawiki/extensions/VisualEditor change 361489,1 based on None
  mwext-VisualEditor-publish: None

Zuul reschedules it with a different uuid:

2017-06-27 08:33:48,508 INFO zuul.Gearman: Launch job mwext-VisualEditor-publish (uuid: 569bcbf276944c928b9d81d2d897efcb) for change <Change 0x7f60f1af5a90 361489,1> with dependent changes []

And it ends up on the same node (there is apparently race condition between the node being faulty and the function being unregistered in Zuul).

I have logged on the instance. dmesg starts at 08:32:53 (P5635) the last lines show:

[Tue Jun 27 08:32:55 2017] intel_rapl: no valid rapl domains found in package 0
[Tue Jun 27 08:32:56 2017] intel_rapl: no valid rapl domains found in package 0
[Tue Jun 27 08:32:57 2017] random: nonblocking pool is initialized
[Tue Jun 27 08:35:27 2017] hrtimer: interrupt took 6398177 ns

cloud-init last lines:

Cloud-init v. 0.7.6 running 'modules:final' at Tue, 27 Jun 2017 08:33:07 +0000. Up 14.14 seconds.
Cloud-init v. 0.7.6 finished at Tue, 27 Jun 2017 08:33:07 +0000. Datasource DataSourceEc2.  Up 14.31 seconds

top shows ksoftirqd has some high IRQ CPU usage and high TIME+

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                   
  3 root      20   0       0      0      0 S  18.6  0.0  14:38.22 ksoftirqd/0                                                                                               
 13 root      20   0       0      0      0 S   0.0  0.0   0:17.60 ksoftirqd/1

Nothing conclusive :(

I have made the slave online again and it managed to run a build just fine. So really I have no idea what might be happening :-(

hashar added a comment.Aug 2 2017, 9:55 AM

It is still occurring though less frequently.

Peter added a comment.Aug 29 2017, 5:15 AM

I haven't seen this after I git back from vacation (a week ago). Maybe it was fixed when we moved to the dedicated slave(s)?

Krinkle closed this task as Resolved.Sep 4 2017, 7:44 PM

Let's re-open if we encounter it again.