Page MenuHomePhabricator

Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied"
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/112436/console

21:09:00 [mwext-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins3384731697099042303.sh
21:09:00 + set -eux
21:09:00 + mkdir -m 2777 -p log
21:09:00 mkdir: cannot create directory ‘log’: Permission denied
21:09:00 Build step 'Execute shell' marked build as failure

The jobs usually passing after resubmit. Have seen this on various jobs, this one is just an example.

This might be due to postbuild script not being executed properly when a job is cancelled T299995


Upstream https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363/

Event Timeline

From that same run:

13:08:58 [mwext-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins7676320888490307308.sh
13:08:58 + mkdir -m 2777 -p cache

^ worked without error. Bizarre.

hashar claimed this task.

It might be due to a job completing abruptly and not running the publisher step that wipe up the workspace. Roughly it runs a find . -delete as root. That cleaner did run when the job failed to create the log directory and thus the next build on the host worked. It is essentially self fixing.

The previous run of that job on that agent apparently did run the find -delete. So I don't know what might have happened :-\

I am not quite sure it will happen, seems to be just a one off error. But thank you to have reported it at least!

hashar reopened this task as Open.EditedMay 18 2021, 9:16 PM

I assumed it was just a one time issue when clearly it is not :(

I guess we can revisit how the directories are created on the host so that instead of mkdir -m 2777 {cache,log,src} we use a container that runs as nobody, ensure the directory are wiped and recreate them.

On integration-agent-docker-1002.integration.eqiad1.wikimedia.cloud there is a job workspace owned by root:root which would explain the failure to create a log directory:

drwxr-xr-x 3 root root 4096 May 25 07:11 /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2

It got generated as part of this build: https://integration.wikimedia.org/ci/job/quibble-composer-mysql-php74-noselenium-docker/6967/console

As jenkins-deploy user it mkdir -m 2777 -p cache. The build eventually fails due to:

07:11:37 [PostBuildScript] - [INFO] Executing post build scripts.
07:11:37 Waiting for the completion of castor-save-workspace-cache
07:11:37 [PostBuildScript] - [ERROR] An error occured during post-build processing.
07:11:37 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
07:11:37 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
07:11:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:803)
07:11:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:752)
07:11:37 	at hudson.model.Build$BuildExecution.post2(Build.java:177)
07:11:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:697)
07:11:37 	at hudson.model.Run.execute(Run.java:1932)
07:11:37 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
07:11:37 	at hudson.model.ResourceController.execute(ResourceController.java:97)
07:11:37 	at hudson.model.Executor.run(Executor.java:429)
07:11:37 Caused by: java.lang.InterruptedException
07:11:37 	at java.base/java.lang.Object.wait(Native Method)
07:11:37 	at java.base/java.lang.Object.wait(Object.java:328)
07:11:37 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:79)
07:11:37 	at hudson.model.queue.FutureImpl.waitForStart(FutureImpl.java:68)
07:11:37 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:146)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
07:11:37 	... 13 more
07:11:37 Build step 'Execute scripts' changed build result to FAILURE
07:11:37 Build step 'Execute scripts' marked build as failure

No idea what java.lang.InterruptedException is about.

The job then runs:

07:11:38 Clearing /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2/cache
07:11:38 + exec docker run --volume /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-composer-mysql-php74-noselenium-docker --label jenkins.build=6967 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.2.1 clear

And finally:

rmdir /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2

Another build of the exact same job ran concurrently https://integration.wikimedia.org/ci/job/quibble-composer-mysql-php74-noselenium-docker/6965/ and got aborted by Zuul. That job ran in quibble-composer-mysql-php74-noselenium-docker. The post build script do:

07:11:45 exec docker run --volume /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-composer-mysql-php74-noselenium-docker --label jenkins.build=6965 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.2.1 clear

07:11:49 + docker ps -q --filter label=jenkins.job=quibble-composer-mysql-php74-noselenium-docker --filter label=jenkins.build=6965
07:11:49 + xargs --no-run-if-empty docker stop
07:11:49 + exec docker run --user=root --volume /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker:/workspace --entrypoint=/usr/bin/find --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-composer-mysql-php74-noselenium-docker --label jenkins.build=6965 --env-file /dev/fd/63 docker-registry.wikimedia.org/wikimedia-buster:latest /workspace -mindepth 1 -delete

And finally:

07:11:52 + rmdir /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker
07:11:52 Finshed: ABORTED

I guess there is a race condition somewhere that causes the cache directory to be recreated by the Docker daemon somehow :-\

@hashar did you unassign because this is an upstream bug? Reassigning to move to Release-Engineering-Team (Radar) if that's the case.

Change 965199 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: ls WORKSPACE after it got wiped

https://gerrit.wikimedia.org/r/965199

Mentioned in SAL (#wikimedia-releng) [2023-10-13T09:56:07Z] <hashar> integration: updating Jenkins jobs to list potentially remaining files after clearing the WORKSPACE | https://gerrit.wikimedia.org/r/c/integration/config/+/965199 | T282893

Change 965199 merged by jenkins-bot:

[integration/config@master] jjb: ls WORKSPACE after it got wiped

https://gerrit.wikimedia.org/r/965199

Change 966220 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: ls WORKSPACE before creating dirs

https://gerrit.wikimedia.org/r/966220

Mentioned in SAL (#wikimedia-releng) [2023-10-16T15:14:58Z] <hashar> integration: updating Jenkins jobs to list remaining files in WORKSPACE before creating cache/log/src directories https://gerrit.wikimedia.org/r/c/integration/config/+/966220 # T282893

Change 966220 merged by jenkins-bot:

[integration/config@master] jjb: ls WORKSPACE before creating dirs

https://gerrit.wikimedia.org/r/966220

Some notes today:

Seems like the workspace directory is owned by root at the start of the job.

This happened on https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47923/console

The first step in the build is:

ls -laF --color=always
mkdir -m 2777 -p "cache"

Which resulted in this:

14:26:20 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins13039068635877563154.sh
14:26:20 + ls -laF --color=always
14:26:20 total 12
14:26:20 drwxr-xr-x  3 root           root    4096 Nov 20 19:34 ./
14:26:20 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 21:19 ../
14:26:20 drwxr-xr-x  2 root           root    4096 Nov 20 19:34 cache/
14:26:20 + mkdir -m 2777 -p cache

So the build didn't fail, because the mkdir -p noticed the directory existed and went on with it's life.

But the next step fails:

21:26:22 + ls -laF --color=always
21:26:22 total 12
21:26:22 drwxr-xr-x  3 root           root    4096 Nov 20 19:34 ./
21:26:22 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 21:19 ../
21:26:22 drwxr-xr-x  2 root           root    4096 Nov 20 19:34 cache/
21:26:22 + mkdir -m 2777 -p log
21:26:22 mkdir: cannot create directory ‘log’: Permission denied

The question is: how did a workspace end up owned by root?

The one before that was a success https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47900/console and everything was empty in the workspace at that time:

16:11:02 Listing potentially remaining files in workspace for T282893
16:11:02 + ls -laF --color=always
16:11:02 total 8
16:11:02 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 16:11 ./
16:11:02 drwxrwxr-x 47 jenkins-deploy wikidev 4096 Nov 20 16:10 ../

Then the next build of that job on the same instance was https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/ which had an empty workspace (same state at the end of previous build as expected):

19:31:34 + ls -laF --color=always
19:31:35 total 8
19:31:35 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:31 ./
19:31:35 drwxrwxr-x 46 jenkins-deploy wikidev 4096 Nov 20 19:31 ../

But that build got CANCELED by Zuul apparently when it was processing the postbuild scripts. The end of the build console starting with the last log output by Quibble

19:33:51 INFO:backend.MySQL:Terminating MySQL
19:33:53 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins197830715735037777.sh
19:33:53 + set -eu
19:33:53 + ERROR_FILES=(mw-dberror.log mw-error.log)
19:33:53 + echo 'Asserting empty files: mw-dberror.log mw-error.log'
19:33:53 Asserting empty files: mw-dberror.log mw-error.log
19:33:53 ++ cd /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/log
19:33:53 ++ ls mw-dberror.log mw-error.log
19:33:53 ++ :
19:33:53 + log_files=
19:33:53 + '[' '!' '' ']'
19:33:53 + echo 'No error files created by MediaWiki. GOOD'
19:33:53 No error files created by MediaWiki. GOOD
19:33:53 + exit 0
19:33:53 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:53 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins17618125566903362774.sh
19:33:53 + set -o pipefail
19:33:53 ++ pwd
19:33:53 + exec docker run --entrypoint=/bin/rm --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/buster:latest -fR /log/rawSeleniumVideoGrabs
19:33:53 ++ /usr/bin/env
19:33:53 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:33:54 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:54 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins1508151838611447409.sh
19:33:54 + find log/ -name 'mw-debug-*.log' -exec gzip '{}' +
PostBuildScript

Hide Details
19:33:54 Archiving artifacts
19:33:55 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:55 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -eu /tmp/jenkins14314850712442410323.sh
19:33:55 + chmod 2777 src
19:33:55 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins17614952840139516995.sh
19:33:55 + set -o pipefail
19:33:55 ++ pwd
19:33:55 ++ pwd
19:33:55 ++ pwd
19:33:55 + exec docker run --entrypoint=bash --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/src:/src --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/quibble-buster-php74:1.5.6-s1 -c '([[ "$ZUUL_PIPELINE" =~ ^(gate-and-submit|postmerge) ]] && npm cache verify) || :'
19:33:55 ++ /usr/bin/env
19:33:55 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:33:59 Cache verified and compressed (/cache/npm/_cacache)
19:33:59 Content verified: 900 (24983649 bytes)
19:33:59 Content garbage-collected: 3 (6735 bytes)
19:33:59 Index entries: 900
19:33:59 Finished in 3.009s
19:33:59 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:59 Waiting for the completion of castor-save-workspace-cache
19:34:37 [PostBuildScript] - [ERROR] An error occured during post-build processing.
19:34:37 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
19:34:37 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
19:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
19:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
19:34:37 	at hudson.model.Build$BuildExecution.post2(Build.java:179)
19:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
19:34:37 	at hudson.model.Run.execute(Run.java:1925)
19:34:37 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
19:34:37 	at hudson.model.ResourceController.execute(ResourceController.java:101)
19:34:37 	at hudson.model.Executor.run(Executor.java:442)
19:34:37 Caused by: java.lang.InterruptedException
19:34:37 	at java.base/java.lang.Object.wait(Native Method)
19:34:37 	at java.base/java.lang.Object.wait(Object.java:328)
19:34:37 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:78)
19:34:37 	at hudson.model.queue.FutureImpl.waitForStart(FutureImpl.java:70)
19:34:37 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:151)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
19:34:37 	... 13 more
19:34:37 Build step 'Execute scripts' changed build result to FAILURE
19:34:37 Build step 'Execute scripts' marked build as failure
19:34:37 [PostBuildScript] - [INFO] Executing post build scripts.
19:34:37 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins9623261731729621052.sh
19:34:37 + echo 'Clearing /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache'
19:34:37 Clearing /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache
19:34:37 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins10462326678330743819.sh
19:34:37 + set -o pipefail
19:34:37 ++ pwd
19:34:37 + exec docker run --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.3.0 clear
19:34:37 ++ /usr/bin/env
19:34:37 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:34:39 [PostBuildScript] - [INFO] Executing post build scripts.
19:34:39 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins14588386181888260330.sh
19:34:39 + set -euxo pipefail
19:34:39 + docker ps -q --filter label=jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --filter label=jenkins.build=47917
19:34:39 + xargs --no-run-if-empty docker stop
19:34:39 [PostBuildScript] - [INFO] Executing post build scripts.
19:34:39 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins11250983883235539153.sh
19:34:39 + set -o pipefail
19:34:39 + exec docker run --entrypoint=/usr/bin/find --user=root --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker:/workspace --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/buster:latest /workspace -mindepth 1 -delete
19:34:39 ++ /usr/bin/env
19:34:39 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:34:41 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins6529230831931776331.sh
19:34:41 + echo 'Listing potentially remaining files in workspace for T282893'
19:34:41 Listing potentially remaining files in workspace for T282893
19:34:41 + ls -laF --color=always
19:34:41 total 8
19:34:41 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:34 ./
19:34:41 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 19:33 ../
19:34:42 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins10906938629317061761.sh
19:34:42 + set -u
19:34:42 + rmdir /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker
19:34:42 Finished: FAILURE

But the debug ls -l states there is nothing left in the workspace:

19:34:41 Listing potentially remaining files in workspace for T282893
19:34:41 + ls -laF --color=always
19:34:41 total 8
19:34:41 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:34 ./
19:34:41 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 19:33 ../

The next build of that job on that instance is then https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47923/console which has the workspace erroneously owned by root:root as well as a "new" $WORKSPACE/cache which was not showing up in the last output of the previous build:

21:26:20 Started by user unknown or anonymous
21:26:20 Running as SYSTEM
21:26:20 Building remotely on integration-agent-docker-1047 (pipelinelib Docker blubber) in workspace /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker
21:26:20 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins13039068635877563154.sh
21:26:20 + ls -laF --color=always
21:26:20 total 12
21:26:20 drwxr-xr-x  3 root           root    4096 Nov 20 19:34 ./
21:26:20 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 21:19 ../
21:26:20 drwxr-xr-x  2 root           root    4096 Nov 20 19:34 cache/
21:26:20 + mkdir -m 2777 -p cache

So my fringe theory holds: even though tentatively everything got erased in build 47917 there is something holding or recreating the cache directory. Possibly the castor-save-workspace-cache (since the build ran from gate-and-submit it does executes). Alternatively one the container takes take time to be reaped off, possibly cause the SIGTERM is ignored/shallowed, Docker then wait 10 seconds before sending a SIGKILL and would maybe magically recreate the cache directory with the default root:root.

Maybe Docker has some verbose logs to keep track of what it is doing, possibly by setting:

/etc/docker/daemon.json
  {
    "log-driver": "none"
+   "debug": true
  }

Timestamps are interesting here. This job happened at 2023-11-20T21:26:20.128000 and the cache directory shows: Nov 20 19:34.

OK, what about this job in general around that time?

sqlite> select timestamp, builtOn, job_name, job_id, result from jobs where job_name = 'mediawiki-quibble-apitests-vendor-php74-docker' and timestamp like '2023-11-20T19%';
2023-11-20T19:35:02.295000|integration-agent-docker-1049|mediawiki-quibble-apitests-vendor-php74-docker|47918|SUCCESS
2023-11-20T19:31:34.874000|integration-agent-docker-1047|mediawiki-quibble-apitests-vendor-php74-docker|47917|FAILURE
2023-11-20T19:07:47.439000|integration-agent-docker-1048|mediawiki-quibble-apitests-vendor-php74-docker|47916|SUCCESS

Bingo: one job happened on that node at 19:31 and was a failure. That looks promising as the culprit: https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/console

That job created the cache directory at 19:31 (timestamps below are my local time where 12 == 19):

12:31:37 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins2237295320681026535.sh
12:31:37 + set -eux
12:31:37 + ls -laF --color=always
12:31:37 total 12
12:31:37 drwxr-xr-x  3 jenkins-deploy wikidev 4096 Nov 20 19:31 ./
12:31:37 drwxrwxr-x 46 jenkins-deploy wikidev 4096 Nov 20 19:31 ../
12:31:37 drwxrwsrwx  4 jenkins-deploy wikidev 4096 Nov 20 19:31 cache/

And then it failed because there was a java exception waiting on castor-workspace-save triggered at 19:33:

12:33:59 Waiting for the completion of castor-save-workspace-cache
12:34:37 [PostBuildScript] - [ERROR] An error occured during post-build processing.
12:34:37 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
12:34:37 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
12:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
12:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
12:34:37 	at hudson.model.Build$BuildExecution.post2(Build.java:179)
12:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
12:34:37 	at hudson.model.Run.execute(Run.java:1925)
12:34:37 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
12:34:37 	at hudson.model.ResourceController.execute(ResourceController.java:101)
12:34:37 	at hudson.model.Executor.run(Executor.java:442)
12:34:37 Caused by: java.lang.InterruptedException
12:34:37 	at java.base/java.lang.Object.wait(Native Method)
12:34:37 	at java.base/java.lang.Object.wait(Object.java:328)
12:34:37 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:78)
12:34:37 	at hudson.model.queue.FutureImpl.waitForStart(FutureImpl.java:70)
12:34:37 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:151)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
12:34:37 	... 13 more

It says the workdir is empty at the end of the build:

12:34:41 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins6529230831931776331.sh
12:34:41 + echo 'Listing potentially remaining files in workspace for T282893'
12:34:41 Listing potentially remaining files in workspace for T282893
12:34:41 + ls -laF --color=always
12:34:41 total 8
12:34:41 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:34 ./
12:34:41 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 19:33 ../

This is the same exception that @hashar noted in T282893#7111804

So there's something happening there with that exception during waiting for the completion of castor-save-workspace-cache. Potentially one of the other jobs in flight around that time?

I note they all succeeded:

sqlite> select timestamp, builtOn, job_name, job_id, result from jobs where job_name = 'castor-save-workspace-cache' and timestamp like '2023-11-20T19:33%';
2023-11-20T19:33:56.470000|integration-castor05|castor-save-workspace-cache|4112336|SUCCESS
2023-11-20T19:33:41.069000|integration-castor05|castor-save-workspace-cache|4112335|SUCCESS
2023-11-20T19:33:37.688000|integration-castor05|castor-save-workspace-cache|4112334|SUCCESS
2023-11-20T19:33:28.150000|integration-castor05|castor-save-workspace-cache|4112333|SUCCESS
2023-11-20T19:33:25.050000|integration-castor05|castor-save-workspace-cache|4112332|SUCCESS
2023-11-20T19:33:20.262000|integration-castor05|castor-save-workspace-cache|4112331|SUCCESS
2023-11-20T19:33:17.372000|integration-castor05|castor-save-workspace-cache|4112330|SUCCESS
2023-11-20T19:33:13.098000|integration-castor05|castor-save-workspace-cache|4112329|SUCCESS
2023-11-20T19:33:09.321000|integration-castor05|castor-save-workspace-cache|4112328|SUCCESS
2023-11-20T19:33:03.633000|integration-castor05|castor-save-workspace-cache|4112327|SUCCESS
sqlite> select timestamp, builtOn, job_name, job_id, result from jobs where job_name = 'castor-save-workspace-cache' and timestamp like '2023-11-20T19:34%';
2023-11-20T19:34:57.731000|integration-castor05|castor-save-workspace-cache|4112343|SUCCESS
2023-11-20T19:34:54.726000|integration-castor05|castor-save-workspace-cache|4112342|SUCCESS
2023-11-20T19:34:51.608000|integration-castor05|castor-save-workspace-cache|4112341|SUCCESS
2023-11-20T19:34:48.570000|integration-castor05|castor-save-workspace-cache|4112340|SUCCESS
2023-11-20T19:34:45.167000|integration-castor05|castor-save-workspace-cache|4112339|SUCCESS
2023-11-20T19:34:41.603000|integration-castor05|castor-save-workspace-cache|4112338|SUCCESS
2023-11-20T19:34:35.846000|integration-castor05|castor-save-workspace-cache|4112337|SUCCESS

Here's the castor job that 47917 lost track of: https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/4112339/console

It runs at 19:34:46—5 seconds after 47917 is finished. So the workspace's cache directory has been gone for 5 seconds.

Then castor-save-workspace-cache runs rsync. For the --rsync-path on the remote, it uses docker. So it's running docker on integration-agent-docker-1047 with the workspace's cache directory mounted as a volume, but that path no longer exists on integration-agent-docker-1047.

So docker is creating that directory. As root. Demo:

$ whoami
thcipriani
$ ls | grep -c this-dir-doesnt-exist
0
$ docker run --rm -it --user $(id -u):$(id -g) --volume $(pwd)/this-dir-doesnt-exist:/test --entrypoint /bin/true debian:bookworm
$ ls -l | grep this-dir-doesnt-exist
drwxr-xr-x 2 root       root          4096 Nov 20 17:04 this-dir-doesnt-exist

From last night the sequence is:

https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/console got cancelled

19:33:59 postbuild step triggers the castor-save-workspace-cache job
19:34:37 the build step get a org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException

The cache directory get erased 19:34:37 Clearing /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache

19:34:41 the ls -l shows everything is empty
19:34:42 Finished: FAILURE

Meanwhile the castor-save-workspace job is still in the queue and looking at its build description:

45 sec waiting;
3.3 sec build duration;
48 sec total from scheduled to completion.

The job rsync from whatever instance ran the build to fetch the caches and store them on the integration-castor05 instance with a trick which is that it is NOT rsync being run on the remote side but:

docker run --rm -i --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --entrypoint=/usr/bin/rsync ...

Which started at 19:34:45 (after the upstream project has completed) and completed at 19:34:48:

19:34:45 Started by upstream project "mediawiki-quibble-apitests-vendor-php74-docker" build number 47917
19:34:45 originally caused by:
19:34:45  Started by user unknown or anonymous
19:34:45 Running as SYSTEM
19:34:45 Building remotely on integration-castor05 (castor) in workspace /srv/jenkins/workspace/castor-save-workspace-cache
19:34:45 [ssh-agent] Looking for ssh-agent implementation...
19:34:45 [ssh-agent]   Exec ssh-agent (binary ssh-agent on a remote machine)
19:34:45 $ ssh-agent
19:34:45 SSH_AUTH_SOCK=/tmp/ssh-KqAj54SYMydS/agent.185518
19:34:45 SSH_AGENT_PID=185520
19:34:45 [ssh-agent] Started.
19:34:45 Running ssh-add (command line suppressed)
19:34:45 Identity added: /srv/jenkins/workspace/castor-save-workspace-cache@tmp/private_key_13591602856893160767.key (/srv/jenkins/workspace/castor-save-workspace-cache@tmp/private_key_13591602856893160767.key)
19:34:45 [ssh-agent] Using credentials jenkins-deploy (key to connect to labs instances set up with role::ci::slave::labs::common)
19:34:46 [castor-save-workspace-cache] $ /bin/bash -xe /tmp/jenkins227135908481326102.sh
19:34:46 + set -eu +x
19:34:46 Defined: CASTOR_NAMESPACE="mediawiki-core/master/mediawiki-quibble-apitests-vendor-php74-docker"
19:34:46 Creating directory holding cache:
19:34:46 Syncing cache
19:34:46 From.. 172.16.3.56:/srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache
19:34:46 To.... /srv/castor/mediawiki-core/master/mediawiki-quibble-apitests-vendor-php74-docker
19:34:46 + rsync --archive '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' '--rsync-path=docker run --rm -i --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --entrypoint=/usr/bin/rsync docker-registry.wikimedia.org/releng/castor:0.3.0' --delete-delay --delay-updates jenkins-deploy@172.16.3.56:/cache/ /srv/castor/mediawiki-core/master/mediawiki-quibble-apitests-vendor-php74-docker
19:34:46 Warning: Permanently added '172.16.3.56' (ECDSA) to the list of known hosts.
19:34:48 + echo -e '\nDone'
19:34:48 
19:34:48 Done
19:34:48 $ ssh-agent -k
19:34:48 unset SSH_AUTH_SOCK;
19:34:48 unset SSH_AGENT_PID;
19:34:48 echo Agent pid 185520 killed;
19:34:48 [ssh-agent] Stopped.
19:34:48 Finished: SUCCESS

Given it ran after the build has completed and the workspace got erased, docker unhelpfully creates a cache directory owned by root:root which is NOT garbage collected since the upstream build has already completed.

--rsync-path is evaluated by sh so we could add a shortcircuit to skip the rsync with something such as --rsync-path='[ -d path/to/cache ] && docker rsync....

The downstream project should be canceled which is most probably a bug in Postbuildscript Jenkins plugin or in the Parametrized trigger Jenkins plugin.

I think the issue comes from Parameterized trigger commit 6d4461c79. In TriggerBuilder.perform():

                        for (QueueTaskFuture<AbstractBuild> future : futures.get(p)) {
                            try {
                                    Run startedRun = future.waitForStart();                
...
                            } catch (CancellationException x) {                        
                                throw new AbortException(p.getFullDisplayName() +" aborted.");                                                                                                
                            }

It thus has logic to catch a CancellationException but does not catch InterruptedException which is thrown by waitForStart(). From Jenkins core:

core/src/main/java/hudson/model/queue/QueueTaskFuture.java
public interface QueueTaskFuture<R extends Executable> extends Future<R> {

    /**
     * Short for {@code getStartCondition().get()}
     */
    R waitForStart() throws InterruptedException, ExecutionException;

Thus I guess the plugin should wrap waitForStart() to catch InterruptedException then cancel the build:

core/src/main/java/hudson/model/queue/FutureImpl.java
public final class FutureImpl extends AsyncFutureImpl<Executable> implements QueueTaskFuture<Executable> {
    @Override
    public Executable waitForStart() throws InterruptedException, ExecutionException {
        return getStartCondition().get();
    }

    @Override
    public boolean cancel(boolean mayInterruptIfRunning) {
        Queue q = Jenkins.get().getQueue();
        synchronized (this) {
            if (!executors.isEmpty()) {
                if (mayInterruptIfRunning)
                    for (Executor e : executors)
                        e.interrupt();
                return mayInterruptIfRunning;
            }
            return q.cancel(task);
        }
    }

I have forked the plugin and pushed a dumb commit in at https://github.com/hashar/parameterized-trigger-plugin/tree/cancel-waiting-builds

https://github.com/jenkinsci/parameterized-trigger-plugin/compare/master...hashar:parameterized-trigger-plugin:cancel-waiting-builds#diff-1ee1a56287e0d391189d085cdcf4dd5ddc9a1c6d45856622c444b68aab50688a

src/main/java/hudson/plugins/parameterizedtrigger/TriggerBuilder.java
                                 if (future != null ) {
                                     listener.getLogger().println("Waiting for the completion of " + HyperlinkNote.encodeTo('/'+ p.getUrl(), p.getFullDisplayName()));
-                                    Run startedRun = future.waitForStart();
+                                    Run startedRun;
+                                    try {
+                                        startedRun = future.waitForStart();
+                                    } catch (InterruptedException x) {
+                                        future.cancel(true);
+                                        continue;
+                                    }
                                     listener.getLogger().println(HyperlinkNote.encodeTo('/' + startedRun.getUrl(), startedRun.getFullDisplayName()) + " started.");

I have tried to write a test to cover it but eventually gave up since well.. I don't know much about Java or Jenkins internals.

I guess I can at least catch the exception, log it and raise it again. Regardless it might be helpful to update the plugin a bit in production.

I have send a dummy patch as https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363

which can be build with eg:

mvn clean package  -Drevision=2.44 -Dchangelist=-1-g$(git rev-parse --short HEAD)-T282893 \
  && (unzip -p target/parameterized-trigger.jar META-INF/MANIFEST.MF | egrep ^Plugin-Version)
...
Plugin-Version: 2.44-1-g76a5de9-T282893

~~Uploaded at https://people.wikimedia.org/~hashar/T282893/parameterized-trigger-2.44-1-g76a5de9-T282893.hpi~~

That is bugged, the canceled build ends up being flagged as a SUCCESS ;)

Mentioned in SAL (#wikimedia-operations) [2023-11-23T06:23:57Z] <hashar> Restarting CI Jenkins for plugin update # T282893

My pull request has been merged and upstream switched on continuous delivery so we get a new release automagically (787.v665fcf2a_830b_).

I went to apply the update on https://integration.wikimedia.org/ci/ which also upgraded:

  • Caffeine API
  • Script Security
  • Matrix Project
  • Conditional BuildStep
  • Parameterized Trigger

Tentatively marking it as resolved