integration-slave-trusty-1012, -1013, and -1015 unresponsive
Closed, ResolvedPublic

Description

These three instances have been unresponsive for the last 24 hours. I'm also not able to establish an SSH connection.

integration-slave-precise-1011.eqiad.wmflabs: 1
integration-slave-precise-1012.eqiad.wmflabs: 1
integration-slave-precise-1013.eqiad.wmflabs: 1
integration-slave-precise-1014.eqiad.wmflabs: 1
integration-slave-trusty-1011.eqiad.wmflabs: 1
integration-slave-trusty-1014.eqiad.wmflabs: 1
integration-slave-trusty-1016.eqiad.wmflabs: 1
integration-slave-trusty-1017.eqiad.wmflabs: 1

Connection closed:

  • integration-slave-trusty-1012.eqiad.wmflabs
  • integration-slave-trusty-1013.eqiad.wmflabs
  • integration-slave-trusty-1015.eqiad.wmflabs

YuviPanda: Krinkle: I get "Connection closed by Unknown", usually this means something OOM'd
YuviPanda: Krinkle: no output in 'get console output' due to wikitech flakiness, I presume.

Krinkle created this task.Jun 7 2015, 8:42 PM
Krinkle updated the task description. (Show Details)
Krinkle raised the priority of this task from to Unbreak Now!.
Krinkle added subscribers: Krinkle, hashar, yuvipanda.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 7 2015, 8:42 PM

https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/19823/consoleFull was on integration-slave-trusty-1012 and timed out after 30 minutes. Should we depool the stuck trusty slaves for now?

I marked all 3 slaves as offline in jenkins.

hashar closed this task as Resolved.Jun 8 2015, 8:55 AM
hashar claimed this task.

Seems it was some a transient labs issue. I have rebooted the three instances and repooled them:

integration-slave-trusty-1012.eqiad.wmflabs
integration-slave-trusty-1013.eqiad.wmflabs
integration-slave-trusty-1015.eqiad.wmflabs

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald Transcript