Page MenuHomePhabricator

integration-slave-precise-1011 and integration-slave-precise-1014 went offline
Closed, ResolvedPublic

Description

The slave https://integration.wikimedia.org/ci/computer/integration-slave-precise-1011/ went offline, it apparently rejects the SSH key:

[08/24/15 14:34:34] [SSH] Opening SSH connection to 10.68.17.70:22.
ERROR: Server rejected the 1 private key(s) for jenkins-deploy (credentialId:ae711ff4-813e-4462-9a27-21bdbd4fdcb9/method:publickey)
[08/24/15 14:34:34] [SSH] Authentication failed.
hudson.AbortException: Authentication failed.
	at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1178)
	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:701)
	at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:696)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
[08/24/15 14:34:34] Launch failed - cleaning up connection
[08/24/15 14:34:34] [SSH] Connection closed.

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar subscribed.

Can't SSH to it. Hard rebooting the instance via the Horizon dashboard.

Seems there is an issue with ElasticSearch:

 * Starting Elasticsearch Server       [80G [74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]
[74G[[31mfail[39;49m]

Went through Salt to kill the startup script:

root@integration-saltmaster:~# salt 'integration-slave-precise-1011*' cmd.run 'ps -A|grep elast'
integration-slave-precise-1011.integration.eqiad.wmflabs:
     1132 ?        00:00:00 S20elasticsearc
root@integration-saltmaster:~# salt 'integration-slave-precise-1011*' cmd.run 'kill 1132'
integration-slave-precise-1011.integration.eqiad.wmflabs:
root@integration-saltmaster:~# salt 'integration-slave-precise-1011*' cmd.run 'ps -A|grep elast'
integration-slave-precise-1011.integration.eqiad.wmflabs:
root@integration-saltmaster:~#

Upgraded the packages and I removed a stalled lock /var/lib/puppet/state/agent_catalog_run.lock from Aug 17th 15:06.

Running puppet deadlocks on starting elasticsearch:

4089 pts/1    S+     0:16                      \_ /usr/bin/ruby /usr/bin/puppet agent -tv
4983 ?        Ss     0:00                          \_ /bin/sh /etc/init.d/elasticsearch start
5308 ?        S      0:00                              \_ sleep 1
root@integration-slave-precise-1011:~# /etc/init.d/elasticsearch start
 * Starting Elasticsearch Server                                                                                                        touch: cannot touch `/var/run/elasticsearch/elasticsearch.pid': No such file or directory

That is T109497

hashar claimed this task.

The machine is back. The elasticsearch issue is T109497: elasticsearch 1.6.0 fails to start after reboot

hashar renamed this task from integration-slave-precise-1011 went offline to integration-slave-precise-1011 and integration-slave-precise-1014 went offline.Aug 24 2015, 3:11 PM
hashar set Security to None.