Sometimes the maintenance-disconnect-full-disks job gets stuck and someone will have to abort the job manually in order for it to work (https://tools.wmflabs.org/sal/log/AWW62T8JoDEJc1hAtCPG).
Should have a timeout.
Sometimes the maintenance-disconnect-full-disks job gets stuck and someone will have to abort the job manually in order for it to work (https://tools.wmflabs.org/sal/log/AWW62T8JoDEJc1hAtCPG).
Should have a timeout.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Wrap maintenance with timeout | integration/config | master | +15 -20 | |
Refactor maintenance to timeout after 5 minutes | integration/config | master | +77 -89 |
Change 460174 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Refactor maintenance to timeout after 5 minutes
Mentioned in SAL (#wikimedia-releng) [2018-09-13T08:42:11Z] <hashar> aborted maintenance-disconnect-full-disks job | T204077
The job got stuck eventually ( console of build #2389 ) with:
Started by user thcipriani Running in Durability level: MAX_SURVIVABILITY [Pipeline] node Running on contint1001 in /srv/jenkins-slave/workspace/maintenance-disconnect-full-disks@5 [Pipeline] End of Pipeline java.lang.ArrayIndexOutOfBoundsException: 0 at org.jenkinsci.plugins.workflow.cps.DSL$ThreadTaskImpl.invokeBody(DSL.java:588) at org.jenkinsci.plugins.workflow.cps.DSL$ThreadTaskImpl.eval(DSL.java:559) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:184) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE
Though eventually some other builds have been done, I guess because the job can be run concurrently.
I don't see this in the logs for this job, the last aborted job was https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/1974/ that I aborted
This was me working on the job last night and the duration of that job was < 1 second nothing hung there.
Change 460174 merged by jenkins-bot:
[integration/config@master] Refactor maintenance to timeout after 5 minutes
Deployed this Thursday and actually did have a build timeout yesterday and nobody had to abort \o/ :
https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/2641/console
Calling this resolved.
It was stuck again an hour or so again :\
A potential theory: the script iterates over all slaves but a Nodepool might have been deleted during the execution and that would stuck the job. Maybe skip all slaves with a hostname prefix of ci-jessie ?
Change 461174 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Wrap maintenance with timeout
Change 461174 merged by jenkins-bot:
[integration/config@master] Wrap maintenance with timeout
https://integration.wikimedia.org/ci/job/maintenance-disconnect-full-disks/4319/console got stuck in the same way I've observed in the period between when I first closed this task and when I deployed the second patch, but this time it aborted on its own.
Closing this again.