Page MenuHomePhabricator

castor-save-workspace-cache aborted during postbuild
Open, Needs TriagePublic

Description

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php80-noselenium-docker/25222/console apparently ran properly. During the post build step, the castor-save-workspace-cache failed:

15:18:44 [PostBuildScript] - [INFO] Executing post build scripts.
15:18:44 Waiting for the completion of castor-save-workspace-cache
15:18:44 [PostBuildScript] - [ERROR] An error occured during post-build processing.
15:20:18 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: hudson.AbortException: castor-save-workspace-cache aborted.
15:20:18 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
15:20:18 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
15:20:18 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
15:20:18 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
15:20:18 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
15:20:18 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
15:20:18 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
15:20:18 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
15:20:18 	at hudson.model.Build$BuildExecution.post2(Build.java:179)
15:20:18 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
15:20:18 	at hudson.model.Run.execute(Run.java:1925)
15:20:18 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
15:20:18 	at hudson.model.ResourceController.execute(ResourceController.java:101)
15:20:18 	at hudson.model.Executor.run(Executor.java:442)
15:20:18 Caused by: hudson.AbortException: castor-save-workspace-cache aborted.
15:20:18 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:180)
15:20:18 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
15:20:18 	... 13 more
15:20:18 Build step 'Execute scripts' changed build result to FAILURE
15:20:18 Build step 'Execute scripts' marked build as failure

The build was for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalPreferences/+/978582 in gate-and-submit and I don't think it got canceled by Zuul.

Note how the the job triggering waited 1m34 before getting aborted.

Versions of plugins appearing in the stacktrace:

postbuildscript3.2.0-550.v88192b_d3e922
parameterized-trigger787.v665fcf2a_830b_

I have patched the parameterized-trigger plugin in T282893 to have it cancel the triggered job when the parent job receives an InterruptedException. The frame hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:180) is:

src/main/java/hudson/plugins/parameterizedtrigger/TriggerBuilder.java
179                             } catch (CancellationException x) {
180                                 throw new AbortException(p.getFullDisplayName() +" aborted.");
181                             }

Which mean the castor-save-workspace-cache caused a CancellationException? If I read it right it neither:

  • got interrupted (my patch cause a line to be printed "Build aborting: cancelling queued project..."
  • nor started (the code would print a line with the url to the job followed by "started")

So my guess is it got cancelled while future.waitForStart(). Maybe there is a timeout. At the time the issue happened, there were a lot of build going on and I highly suspect there was a fairly large queue of jobs waiting.

The job has a 1 minute build timeout, but that is less than the observed 1m34 and the job was in the queue rather than start.

Event Timeline

We should both fix this but also castor-save-workspace-cache successes or failures shouldn't ever affect the job result; it's an infrastructure change, and either way the test result is what it is at this point.

I'm seeing this quite a lot today, see e.g. this search if you need more examples.

Happened in two gate-and-submit builds (1, 2) for the same change today.

Again today at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1027192, happened consistently two times in a row. Reported by @pmiazga.

4 May 22:26
5 May 11:08:
 exec docker run --entrypoint=bash --volume /srv/jenkins/workspace/mediawiki-quibble-vendor-mysql-php81@2/src:/src --volume /srv/jenkins/workspace/mediawiki-quibble-vendor-mysql-php81@2/cache:/cache --volume /srv/jenkins/workspace/mediawiki-quibble-vendor-mysql-php81@2/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-vendor-mysql-php81 --label jenkins.build=564 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/quibble-buster-php81:1.7.0-s2 -c '([[ "$ZUUL_PIPELINE" =~ ^(gate-and-submit|postmerge) ]] && npm cache verify) || :'
++ /usr/bin/env
++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
Cache verified and compressed (/cache/npm/_cacache)
Content verified: 886 (35735934 bytes)
Index entries: 886
Finished in 3.752s
[PostBuildScript] - [INFO] Executing post build scripts.
Waiting for the completion of castor-save-workspace-cache
[PostBuildScript] - [ERROR] An error occured during post-build processing.
org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: hudson.AbortException: castor-save-workspace-cache aborted.
	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
	at hudson.model.Build$BuildExecution.post2(Build.java:179)
	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
	at hudson.model.Run.execute(Run.java:1918)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:442)
Caused by: hudson.AbortException: castor-save-workspace-cache aborted.
	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:180)
	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
	... 13 more
Build step 'Execute scripts' changed build result to FAILURE

This has been happening a lot during the hackathon.

The wrapped exception is hudson.AbortException which extends IOException. The job got aborted for whatever reason, in the postbuildscript plugin:

	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
168     private boolean processBuildSteps(boolean endOfMatrixBuild) throws PostBuildScriptException {
169 
170         try {
171             boolean everyStepSuccessful = true;
172 +----- 16 lines: for (PostBuildStep postBuildStep : config.getBuildSteps()) {-------------------------------------------------------------------------------------------------------------
188             return everyStepSuccessful;
189         } catch (IOException | InterruptedException ioe) {
190             throw new PostBuildScriptException(ioe);
191         }           
192     }

That is caught and rethrown wrapped in PostBuildScriptException.

	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
87     private boolean processScripts(boolean endOfMatrixBuild) throws PostBuildScriptException {
88         @SuppressWarnings("NonShortCircuitBooleanExpression")
89         boolean everyScriptSuccessful = processScriptFiles(endOfMatrixBuild)
90             & processGroovyScripts(endOfMatrixBuild)
91             & processBuildSteps(endOfMatrixBuild);
92         return everyScriptSuccessful || failOrUnstable();
93     }

It let it through.

	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
76     public boolean process(boolean endOfMatrixBuild) {
77         logger.info(Messages.PostBuildScript_ExecutingPostBuildScripts());
78         try {
79             return processScripts(endOfMatrixBuild);
80         } catch (PostBuildScriptException pse) {                                           
81             logger.error(Messages.PostBuildScript_ProblemOccured(), pse);
82             failOrUnstable();
83             return false;
84         }
85     }

That is the parent caller and if there is any exception it mark it FAILED (or UNSTABLE if configured).

Looking at the wrapped AbortException:

Caused by: hudson.AbortException: castor-save-workspace-cache aborted.
	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:180)
	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)

That is different from the InterruptException I have fixed with https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363/ (while looking at T282893). My fix was to rethrow the InterruptException in order to have it caught:

src/main/java/hudson/plugins/parameterizedtrigger/TriggerBuilder.java
152                         for (QueueTaskFuture<AbstractBuild> future : futures.get(p)) {
153                             try {
154 +---------- 47 lines: if (future != null) {-----------------------------------------------------------------------------------------------------------------------------------------------
201                             } catch (CancellationException x) {
202                                 throw new AbortException(p.getFullDisplayName() + " aborted.");
203                             }
204                         }

That is the AbortException that is not handled by the postbuildscript plugin.

I am pasting a WIP fix here in case I am hit by a winning lottery ticket:

--- a/src/main/java/org/jenkinsci/plugins/postbuildscript/processor/Processor.java
+++ b/src/main/java/org/jenkinsci/plugins/postbuildscript/processor/Processor.java
@@ -1,6 +1,7 @@
 package org.jenkinsci.plugins.postbuildscript.processor;
 
 import com.google.common.base.Strings;
+import hudson.AbortException;
 import hudson.EnvVars;
 import hudson.FilePath;
 import hudson.Launcher;
@@ -177,11 +178,16 @@ public class Processor {
                 }
 
                 for (BuildStep buildStep : postBuildStep.getBuildSteps()) {
-                    if (!buildStep.perform(build, launcher, listener)) {
-                        everyStepSuccessful = false;
-                        if (postBuildStep.isStopOnFailure()) {
-                            return false;
-                        }
+                    boolean buildSucceed;
+                    try {
+                        buildSucceed = buildStep.perform(build, launcher, listener);
+                    } catch (AbortException e) {
+                        buildSucceed = false;
+                    }
+                    everyStepSuccessful &= buildSucceed;
+
+                    if (postBuildStep.isStopOnFailure()) {
+                        return false;
                     }
                 }
             }

I have sent a patch upstream https://github.com/jenkinsci/postbuildscript-plugin/pull/219

It is not ideal though but that at least would cause cancelled builds as failures which is an improvement.

I have updated the postbuild script plugin to include https://github.com/jenkinsci/postbuildscript-plugin/pull/219/ . That should make it understand the triggered job got cancelled and cut the stacktrace. It is still going to mark the build as a failure (instead of ignoring the error), I think that is the same as https://issues.jenkins.io/browse/JENKINS-32278

My patch got merged AND released by upstream:

* f6b0b80 - (tag: 3.4.1-695.vf6b_0b_8053979) Upgrade version to 3.4.1 (4 days ago) <Daniel Heid>
*   5772c31 - Merge pull request #219 from hashar/handle-cancelled-steps (4 days ago) <Daniel Heid>
|\  
| *   84d1bd1 - Merge branch 'main' into handle-cancelled-steps (4 days ago) <Daniel Heid>

There is another code path I need to handle though: a job being cancelled while it is in the build queue. That raises a different exception.

When a build is running and get cancelled, its result is marked as ABORTED

Waiting for the completion of downstream-project
downstream-project #5 started.
downstream-project #5 completed. Result was ABORTED
Finished: SUCCESS

And the server logs:

2024-10-16 19:28:33.293+0000 [id=88]	INFO	hudson.model.Run#execute: downstream-project #5 aborted
java.lang.InterruptedException
	at java.base/java.lang.Object.wait(Native Method)

But if the build hasn't started and sit in the build queue and get cancelled from there:

Waiting for the completion of downstream-project
Received cancellation exception
Build step 'Execute scripts' changed build result to FAILURE
Build step 'Execute scripts' marked build as failure
Finished: FAILURE

The later is because the build steps raises a CancellationException which the parameterized-trigger plugin does not pass through it is build map to have it be ignored / never block the build. That, I think, bubbles up and cause the build step to be failing.

Instead both exception should be caught, InterruptedException for ABORTED and CancellationException for NOT_BUILT, which would be the actual result of the build step. The result thus must be passed through the BlockingBehaviour which would determine the actual result of the build or the build step. In our case we want to never block, eg ignore the result.

Short story, I have filed a pull request to Upstream : https://github.com/jenkinsci/parameterized-trigger-plugin/pull/400

I have updated the postbuild script on the CI Jenkins since my PR got merged and a release has been created.

For the Parameterized trigger plugin, I have refreshed https://github.com/jenkinsci/parameterized-trigger-plugin/pull/400 which failed due to some style (fixable by using mvn spotless:apply).

I have packaged it locally with a specific version using:

mvn -Dset.changelist -Dchangelist=$(git describe --tags)-pr400 -DskipTests clean package
$ unzip -p target/parameterized-trigger.hpi META-INF/MANIFEST.MF|grep Plugin-Version
Plugin-Version: 806.vf6fff3e28c3e-31-g9f7e52b-pr400

I will upload it to the CI Jenkins then restart it to have both plugins upgraded.

I have restarted the CI Jenkins

Parameterized Trigger806.vf6fff3e28c3e-31-g9f7e52b-pr400
PostBuildScript3.4.1-695.vf6b_0b_8053979

And if I did it right, the castor job being canceled or aborted would be entirely ignored by the parent build.

In order to verify, I went to search builds that had logs matching the patch I have made to the Parameterized trigger plugin:

find /srv/jenkins/builds -maxdepth 3 -type f -name log -exec egrep --color -H "(Build aborting: cancelling|Not built:)" {} \+

Not built

A build which was waiting for a triggered project which never started:

00:00:37.918 [PostBuildScript] - [INFO] Executing post build scripts.
00:00:37.920 Waiting for the completion of castor-save-workspace-cache
00:04:38.831 Not built: castor-save-workspace-cache has been cancelled while waiting in the queue.

That one got cancelled by receiving a CancellationException after four minutes. The option Fail this build step if the triggered build is worse than or equal to, which is set to never, is now properly managed and ignored.

Build aborting

The other build has:

00:01:45.403 Waiting for the completion of castor-save-workspace-cache
00:03:33.938 Build aborting: cancelling queued project castor-save-workspace-cache

That is the caller job being aborted by Zuul, the triggered job is aborted. Fail this build step if the triggered build is worse than or equal to which is set to never, ignores it. That is catching the InterruptedException which I guess should be rethrown, that is what I did for T282893.