integration-slave-precise-1011 and integration-slave-precise-1014 went offline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Aug 24 2015, 2:35 PM

Description

The slave https://integration.wikimedia.org/ci/computer/integration-slave-precise-1011/ went offline, it apparently rejects the SSH key:

[08/24/15 14:34:34] [SSH] Opening SSH connection to 10.68.17.70:22.
ERROR: Server rejected the 1 private key(s) for jenkins-deploy (credentialId:ae711ff4-813e-4462-9a27-21bdbd4fdcb9/method:publickey)
[08/24/15 14:34:34] [SSH] Authentication failed.
hudson.AbortException: Authentication failed.
	at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1178)
	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:701)
	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:696)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
[08/24/15 14:34:34] Launch failed - cleaning up connection
[08/24/15 14:34:34] [SSH] Connection closed.

Related Objects

Mentioned Here: T109497: elasticsearch 1.6.0 fails to start after reboot

Event Timeline

hashar created this task.Aug 24 2015, 2:35 PM

hashar raised the priority of this task from to Needs Triage.

hashar updated the task description. (Show Details)

hashar added a project: Continuous-Integration-Infrastructure.

hashar subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 24 2015, 2:35 PM

Can't SSH to it. Hard rebooting the instance via the Horizon dashboard.

Seems there is an issue with ElasticSearch:

 * Starting Elasticsearch Server       [80G [74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]

Went through Salt to kill the startup script:

root@integration-saltmaster:~# salt 'integration-slave-precise-1011*' cmd.run 'ps -A|grep elast'
integration-slave-precise-1011.integration.eqiad.wmflabs:
     1132 ?        00:00:00 S20elasticsearc
root@integration-saltmaster:~# salt 'integration-slave-precise-1011*' cmd.run 'kill 1132'
integration-slave-precise-1011.integration.eqiad.wmflabs:
root@integration-saltmaster:~# salt 'integration-slave-precise-1011*' cmd.run 'ps -A|grep elast'
integration-slave-precise-1011.integration.eqiad.wmflabs:
root@integration-saltmaster:~#

Upgraded the packages and I removed a stalled lock /var/lib/puppet/state/agent_catalog_run.lock from Aug 17th 15:06.

Running puppet deadlocks on starting elasticsearch:

4089 pts/1    S+     0:16                      \_ /usr/bin/ruby /usr/bin/puppet agent -tv
4983 ?        Ss     0:00                          \_ /bin/sh /etc/init.d/elasticsearch start
5308 ?        S      0:00                              \_ sleep 1

root@integration-slave-precise-1011:~# /etc/init.d/elasticsearch start
 * Starting Elasticsearch Server                                                                                                        touch: cannot touch `/var/run/elasticsearch/elasticsearch.pid': No such file or directory

That is T109497

The machine is back. The elasticsearch issue is T109497: elasticsearch 1.6.0 fails to start after reboot

hashar renamed this task from integration-slave-precise-1011 went offline to integration-slave-precise-1011 and integration-slave-precise-1014 went offline.Aug 24 2015, 3:11 PM

hashar set Security to None.

integration-slave-precise-1011 and integration-slave-precise-1014 went offlineClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

integration-slave-precise-1011 and integration-slave-precise-1014 went offline
Closed, ResolvedPublic
Actions