maintenance-disconnect-full-disks gets stuck
Closed, ResolvedPublic

Description

Sometimes the maintenance-disconnect-full-disks job gets stuck and someone will have to abort the job manually in order for it to work (https://tools.wmflabs.org/sal/log/AWW62T8JoDEJc1hAtCPG).

Should have a timeout.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2018, 5:59 PM
thcipriani triaged this task as Normal priority.Sep 11 2018, 5:59 PM
thcipriani moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

Change 460174 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Refactor maintenance to timeout after 5 minutes

https://gerrit.wikimedia.org/r/460174

Mentioned in SAL (#wikimedia-releng) [2018-09-13T08:42:11Z] <hashar> aborted maintenance-disconnect-full-disks job | T204077

hashar added a subscriber: hashar.Sep 13 2018, 8:43 AM

The job got stuck eventually ( console of build #2389 ) with:

Started by user thcipriani
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on contint1001 in /srv/jenkins-slave/workspace/maintenance-disconnect-full-disks@5
[Pipeline] End of Pipeline
java.lang.ArrayIndexOutOfBoundsException: 0
	at org.jenkinsci.plugins.workflow.cps.DSL$ThreadTaskImpl.invokeBody(DSL.java:588)
	at org.jenkinsci.plugins.workflow.cps.DSL$ThreadTaskImpl.eval(DSL.java:559)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:184)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Finished: FAILURE

Though eventually some other builds have been done, I guess because the job can be run concurrently.

Mentioned in SAL (#wikimedia-releng) [2018-09-13T08:42:11Z] <hashar> aborted maintenance-disconnect-full-disks job | T204077

I don't see this in the logs for this job, the last aborted job was https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/1974/ that I aborted

The job got stuck eventually ( console of build #2389 ) with:

Started by user thcipriani
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] node
Running on contint1001 in /srv/jenkins-slave/workspace/maintenance-disconnect-full-disks@5
[Pipeline] End of Pipeline
java.lang.ArrayIndexOutOfBoundsException: 0
	at org.jenkinsci.plugins.workflow.cps.DSL$ThreadTaskImpl.invokeBody(DSL.java:588)
	at org.jenkinsci.plugins.workflow.cps.DSL$ThreadTaskImpl.eval(DSL.java:559)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:184)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Finished: FAILURE

This was me working on the job last night and the duration of that job was < 1 second nothing hung there.

Change 460174 merged by jenkins-bot:
[integration/config@master] Refactor maintenance to timeout after 5 minutes

https://gerrit.wikimedia.org/r/460174

thcipriani closed this task as Resolved.Sep 14 2018, 4:22 PM

Deployed this Thursday and actually did have a build timeout yesterday and nobody had to abort \o/ :
https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/2641/console

Calling this resolved.

hashar reopened this task as Open.Sep 15 2018, 12:47 PM

It was stuck again an hour or so again :\

A potential theory: the script iterates over all slaves but a Nodepool might have been deleted during the execution and that would stuck the job. Maybe skip all slaves with a hostname prefix of ci-jessie ?

Change 461174 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Wrap maintenance with timeout

https://gerrit.wikimedia.org/r/461174

Change 461174 merged by jenkins-bot:
[integration/config@master] Wrap maintenance with timeout

https://gerrit.wikimedia.org/r/461174

thcipriani closed this task as Resolved.Sep 21 2018, 3:33 PM

https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/4319/console got stuck in the same way I've observed in the period between when I first closed this task and when I deployed the second patch, but this time it aborted on its own.

Closing this again.

Timeout wrapping solved it I guess. Thank you!