Page MenuHomePhabricator

Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied"
Open, MediumPublic

Description

https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/112436/console

21:09:00 [mwext-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins3384731697099042303.sh
21:09:00 + set -eux
21:09:00 + mkdir -m 2777 -p log
21:09:00 mkdir: cannot create directory ‘log’: Permission denied
21:09:00 Build step 'Execute shell' marked build as failure

The jobs usually passing after resubmit. Have seen this on various jobs, this one is just an example.

This might be due to postbuild script not being executed properly when a job is cancelled T299995


Upstream https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363/

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 965199 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: ls WORKSPACE after it got wiped

https://gerrit.wikimedia.org/r/965199

Mentioned in SAL (#wikimedia-releng) [2023-10-13T09:56:07Z] <hashar> integration: updating Jenkins jobs to list potentially remaining files after clearing the WORKSPACE | https://gerrit.wikimedia.org/r/c/integration/config/+/965199 | T282893

Change 965199 merged by jenkins-bot:

[integration/config@master] jjb: ls WORKSPACE after it got wiped

https://gerrit.wikimedia.org/r/965199

Change 966220 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: ls WORKSPACE before creating dirs

https://gerrit.wikimedia.org/r/966220

Mentioned in SAL (#wikimedia-releng) [2023-10-16T15:14:58Z] <hashar> integration: updating Jenkins jobs to list remaining files in WORKSPACE before creating cache/log/src directories https://gerrit.wikimedia.org/r/c/integration/config/+/966220 # T282893

Change 966220 merged by jenkins-bot:

[integration/config@master] jjb: ls WORKSPACE before creating dirs

https://gerrit.wikimedia.org/r/966220

Some notes today:

Seems like the workspace directory is owned by root at the start of the job.

This happened on https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47923/console

The first step in the build is:

ls -laF --color=always
mkdir -m 2777 -p "cache"

Which resulted in this:

14:26:20 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins13039068635877563154.sh
14:26:20 + ls -laF --color=always
14:26:20 total 12
14:26:20 drwxr-xr-x  3 root           root    4096 Nov 20 19:34 ./
14:26:20 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 21:19 ../
14:26:20 drwxr-xr-x  2 root           root    4096 Nov 20 19:34 cache/
14:26:20 + mkdir -m 2777 -p cache

So the build didn't fail, because the mkdir -p noticed the directory existed and went on with it's life.

But the next step fails:

21:26:22 + ls -laF --color=always
21:26:22 total 12
21:26:22 drwxr-xr-x  3 root           root    4096 Nov 20 19:34 ./
21:26:22 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 21:19 ../
21:26:22 drwxr-xr-x  2 root           root    4096 Nov 20 19:34 cache/
21:26:22 + mkdir -m 2777 -p log
21:26:22 mkdir: cannot create directory ‘log’: Permission denied

The question is: how did a workspace end up owned by root?

The one before that was a success https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47900/console and everything was empty in the workspace at that time:

16:11:02 Listing potentially remaining files in workspace for T282893
16:11:02 + ls -laF --color=always
16:11:02 total 8
16:11:02 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 16:11 ./
16:11:02 drwxrwxr-x 47 jenkins-deploy wikidev 4096 Nov 20 16:10 ../

Then the next build of that job on the same instance was https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/ which had an empty workspace (same state at the end of previous build as expected):

19:31:34 + ls -laF --color=always
19:31:35 total 8
19:31:35 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:31 ./
19:31:35 drwxrwxr-x 46 jenkins-deploy wikidev 4096 Nov 20 19:31 ../

But that build got CANCELED by Zuul apparently when it was processing the postbuild scripts. The end of the build console starting with the last log output by Quibble

19:33:51 INFO:backend.MySQL:Terminating MySQL
19:33:53 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins197830715735037777.sh
19:33:53 + set -eu
19:33:53 + ERROR_FILES=(mw-dberror.log mw-error.log)
19:33:53 + echo 'Asserting empty files: mw-dberror.log mw-error.log'
19:33:53 Asserting empty files: mw-dberror.log mw-error.log
19:33:53 ++ cd /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/log
19:33:53 ++ ls mw-dberror.log mw-error.log
19:33:53 ++ :
19:33:53 + log_files=
19:33:53 + '[' '!' '' ']'
19:33:53 + echo 'No error files created by MediaWiki. GOOD'
19:33:53 No error files created by MediaWiki. GOOD
19:33:53 + exit 0
19:33:53 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:53 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins17618125566903362774.sh
19:33:53 + set -o pipefail
19:33:53 ++ pwd
19:33:53 + exec docker run --entrypoint=/bin/rm --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/buster:latest -fR /log/rawSeleniumVideoGrabs
19:33:53 ++ /usr/bin/env
19:33:53 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:33:54 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:54 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins1508151838611447409.sh
19:33:54 + find log/ -name 'mw-debug-*.log' -exec gzip '{}' +
PostBuildScript

Hide Details
19:33:54 Archiving artifacts
19:33:55 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:55 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -eu /tmp/jenkins14314850712442410323.sh
19:33:55 + chmod 2777 src
19:33:55 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins17614952840139516995.sh
19:33:55 + set -o pipefail
19:33:55 ++ pwd
19:33:55 ++ pwd
19:33:55 ++ pwd
19:33:55 + exec docker run --entrypoint=bash --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/src:/src --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/log:/log --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/quibble-buster-php74:1.5.6-s1 -c '([[ "$ZUUL_PIPELINE" =~ ^(gate-and-submit|postmerge) ]] && npm cache verify) || :'
19:33:55 ++ /usr/bin/env
19:33:55 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:33:59 Cache verified and compressed (/cache/npm/_cacache)
19:33:59 Content verified: 900 (24983649 bytes)
19:33:59 Content garbage-collected: 3 (6735 bytes)
19:33:59 Index entries: 900
19:33:59 Finished in 3.009s
19:33:59 [PostBuildScript] - [INFO] Executing post build scripts.
19:33:59 Waiting for the completion of castor-save-workspace-cache
19:34:37 [PostBuildScript] - [ERROR] An error occured during post-build processing.
19:34:37 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
19:34:37 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
19:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
19:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
19:34:37 	at hudson.model.Build$BuildExecution.post2(Build.java:179)
19:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
19:34:37 	at hudson.model.Run.execute(Run.java:1925)
19:34:37 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
19:34:37 	at hudson.model.ResourceController.execute(ResourceController.java:101)
19:34:37 	at hudson.model.Executor.run(Executor.java:442)
19:34:37 Caused by: java.lang.InterruptedException
19:34:37 	at java.base/java.lang.Object.wait(Native Method)
19:34:37 	at java.base/java.lang.Object.wait(Object.java:328)
19:34:37 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:78)
19:34:37 	at hudson.model.queue.FutureImpl.waitForStart(FutureImpl.java:70)
19:34:37 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:151)
19:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
19:34:37 	... 13 more
19:34:37 Build step 'Execute scripts' changed build result to FAILURE
19:34:37 Build step 'Execute scripts' marked build as failure
19:34:37 [PostBuildScript] - [INFO] Executing post build scripts.
19:34:37 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins9623261731729621052.sh
19:34:37 + echo 'Clearing /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache'
19:34:37 Clearing /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache
19:34:37 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins10462326678330743819.sh
19:34:37 + set -o pipefail
19:34:37 ++ pwd
19:34:37 + exec docker run --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.3.0 clear
19:34:37 ++ /usr/bin/env
19:34:37 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:34:39 [PostBuildScript] - [INFO] Executing post build scripts.
19:34:39 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins14588386181888260330.sh
19:34:39 + set -euxo pipefail
19:34:39 + docker ps -q --filter label=jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --filter label=jenkins.build=47917
19:34:39 + xargs --no-run-if-empty docker stop
19:34:39 [PostBuildScript] - [INFO] Executing post build scripts.
19:34:39 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash /tmp/jenkins11250983883235539153.sh
19:34:39 + set -o pipefail
19:34:39 + exec docker run --entrypoint=/usr/bin/find --user=root --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker:/workspace --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-quibble-apitests-vendor-php74-docker --label jenkins.build=47917 --env-file /dev/fd/63 docker-registry.wikimedia.org/buster:latest /workspace -mindepth 1 -delete
19:34:39 ++ /usr/bin/env
19:34:39 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
19:34:41 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins6529230831931776331.sh
19:34:41 + echo 'Listing potentially remaining files in workspace for T282893'
19:34:41 Listing potentially remaining files in workspace for T282893
19:34:41 + ls -laF --color=always
19:34:41 total 8
19:34:41 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:34 ./
19:34:41 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 19:33 ../
19:34:42 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins10906938629317061761.sh
19:34:42 + set -u
19:34:42 + rmdir /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker
19:34:42 Finished: FAILURE

But the debug ls -l states there is nothing left in the workspace:

19:34:41 Listing potentially remaining files in workspace for T282893
19:34:41 + ls -laF --color=always
19:34:41 total 8
19:34:41 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:34 ./
19:34:41 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 19:33 ../

The next build of that job on that instance is then https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47923/console which has the workspace erroneously owned by root:root as well as a "new" $WORKSPACE/cache which was not showing up in the last output of the previous build:

21:26:20 Started by user unknown or anonymous
21:26:20 Running as SYSTEM
21:26:20 Building remotely on integration-agent-docker-1047 (pipelinelib Docker blubber) in workspace /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker
21:26:20 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins13039068635877563154.sh
21:26:20 + ls -laF --color=always
21:26:20 total 12
21:26:20 drwxr-xr-x  3 root           root    4096 Nov 20 19:34 ./
21:26:20 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 21:19 ../
21:26:20 drwxr-xr-x  2 root           root    4096 Nov 20 19:34 cache/
21:26:20 + mkdir -m 2777 -p cache

So my fringe theory holds: even though tentatively everything got erased in build 47917 there is something holding or recreating the cache directory. Possibly the castor-save-workspace-cache (since the build ran from gate-and-submit it does executes). Alternatively one the container takes take time to be reaped off, possibly cause the SIGTERM is ignored/shallowed, Docker then wait 10 seconds before sending a SIGKILL and would maybe magically recreate the cache directory with the default root:root.

Maybe Docker has some verbose logs to keep track of what it is doing, possibly by setting:

/etc/docker/daemon.json
  {
    "log-driver": "none"
+   "debug": true
  }

Timestamps are interesting here. This job happened at 2023-11-20T21:26:20.128000 and the cache directory shows: Nov 20 19:34.

OK, what about this job in general around that time?

sqlite> select timestamp, builtOn, job_name, job_id, result from jobs where job_name = 'mediawiki-quibble-apitests-vendor-php74-docker' and timestamp like '2023-11-20T19%';
2023-11-20T19:35:02.295000|integration-agent-docker-1049|mediawiki-quibble-apitests-vendor-php74-docker|47918|SUCCESS
2023-11-20T19:31:34.874000|integration-agent-docker-1047|mediawiki-quibble-apitests-vendor-php74-docker|47917|FAILURE
2023-11-20T19:07:47.439000|integration-agent-docker-1048|mediawiki-quibble-apitests-vendor-php74-docker|47916|SUCCESS

Bingo: one job happened on that node at 19:31 and was a failure. That looks promising as the culprit: https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/console

That job created the cache directory at 19:31 (timestamps below are my local time where 12 == 19):

12:31:37 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins2237295320681026535.sh
12:31:37 + set -eux
12:31:37 + ls -laF --color=always
12:31:37 total 12
12:31:37 drwxr-xr-x  3 jenkins-deploy wikidev 4096 Nov 20 19:31 ./
12:31:37 drwxrwxr-x 46 jenkins-deploy wikidev 4096 Nov 20 19:31 ../
12:31:37 drwxrwsrwx  4 jenkins-deploy wikidev 4096 Nov 20 19:31 cache/

And then it failed because there was a java exception waiting on castor-workspace-save triggered at 19:33:

12:33:59 Waiting for the completion of castor-save-workspace-cache
12:34:37 [PostBuildScript] - [ERROR] An error occured during post-build processing.
12:34:37 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
12:34:37 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
12:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
12:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
12:34:37 	at hudson.model.Build$BuildExecution.post2(Build.java:179)
12:34:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
12:34:37 	at hudson.model.Run.execute(Run.java:1925)
12:34:37 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
12:34:37 	at hudson.model.ResourceController.execute(ResourceController.java:101)
12:34:37 	at hudson.model.Executor.run(Executor.java:442)
12:34:37 Caused by: java.lang.InterruptedException
12:34:37 	at java.base/java.lang.Object.wait(Native Method)
12:34:37 	at java.base/java.lang.Object.wait(Object.java:328)
12:34:37 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:78)
12:34:37 	at hudson.model.queue.FutureImpl.waitForStart(FutureImpl.java:70)
12:34:37 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:151)
12:34:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
12:34:37 	... 13 more

It says the workdir is empty at the end of the build:

12:34:41 [mediawiki-quibble-apitests-vendor-php74-docker] $ /bin/bash -xe /tmp/jenkins6529230831931776331.sh
12:34:41 + echo 'Listing potentially remaining files in workspace for T282893'
12:34:41 Listing potentially remaining files in workspace for T282893
12:34:41 + ls -laF --color=always
12:34:41 total 8
12:34:41 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Nov 20 19:34 ./
12:34:41 drwxrwxr-x 45 jenkins-deploy wikidev 4096 Nov 20 19:33 ../

This is the same exception that @hashar noted in T282893#7111804

So there's something happening there with that exception during waiting for the completion of castor-save-workspace-cache. Potentially one of the other jobs in flight around that time?

I note they all succeeded:

sqlite> select timestamp, builtOn, job_name, job_id, result from jobs where job_name = 'castor-save-workspace-cache' and timestamp like '2023-11-20T19:33%';
2023-11-20T19:33:56.470000|integration-castor05|castor-save-workspace-cache|4112336|SUCCESS
2023-11-20T19:33:41.069000|integration-castor05|castor-save-workspace-cache|4112335|SUCCESS
2023-11-20T19:33:37.688000|integration-castor05|castor-save-workspace-cache|4112334|SUCCESS
2023-11-20T19:33:28.150000|integration-castor05|castor-save-workspace-cache|4112333|SUCCESS
2023-11-20T19:33:25.050000|integration-castor05|castor-save-workspace-cache|4112332|SUCCESS
2023-11-20T19:33:20.262000|integration-castor05|castor-save-workspace-cache|4112331|SUCCESS
2023-11-20T19:33:17.372000|integration-castor05|castor-save-workspace-cache|4112330|SUCCESS
2023-11-20T19:33:13.098000|integration-castor05|castor-save-workspace-cache|4112329|SUCCESS
2023-11-20T19:33:09.321000|integration-castor05|castor-save-workspace-cache|4112328|SUCCESS
2023-11-20T19:33:03.633000|integration-castor05|castor-save-workspace-cache|4112327|SUCCESS
sqlite> select timestamp, builtOn, job_name, job_id, result from jobs where job_name = 'castor-save-workspace-cache' and timestamp like '2023-11-20T19:34%';
2023-11-20T19:34:57.731000|integration-castor05|castor-save-workspace-cache|4112343|SUCCESS
2023-11-20T19:34:54.726000|integration-castor05|castor-save-workspace-cache|4112342|SUCCESS
2023-11-20T19:34:51.608000|integration-castor05|castor-save-workspace-cache|4112341|SUCCESS
2023-11-20T19:34:48.570000|integration-castor05|castor-save-workspace-cache|4112340|SUCCESS
2023-11-20T19:34:45.167000|integration-castor05|castor-save-workspace-cache|4112339|SUCCESS
2023-11-20T19:34:41.603000|integration-castor05|castor-save-workspace-cache|4112338|SUCCESS
2023-11-20T19:34:35.846000|integration-castor05|castor-save-workspace-cache|4112337|SUCCESS

Here's the castor job that 47917 lost track of: https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/4112339/console

It runs at 19:34:46—5 seconds after 47917 is finished. So the workspace's cache directory has been gone for 5 seconds.

Then castor-save-workspace-cache runs rsync. For the --rsync-path on the remote, it uses docker. So it's running docker on integration-agent-docker-1047 with the workspace's cache directory mounted as a volume, but that path no longer exists on integration-agent-docker-1047.

So docker is creating that directory. As root. Demo:

$ whoami
thcipriani
$ ls | grep -c this-dir-doesnt-exist
0
$ docker run --rm -it --user $(id -u):$(id -g) --volume $(pwd)/this-dir-doesnt-exist:/test --entrypoint /bin/true debian:bookworm
$ ls -l | grep this-dir-doesnt-exist
drwxr-xr-x 2 root       root          4096 Nov 20 17:04 this-dir-doesnt-exist

From last night the sequence is:

https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74-docker/47917/console got cancelled

19:33:59 postbuild step triggers the castor-save-workspace-cache job
19:34:37 the build step get a org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException

The cache directory get erased 19:34:37 Clearing /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache

19:34:41 the ls -l shows everything is empty
19:34:42 Finished: FAILURE

Meanwhile the castor-save-workspace job is still in the queue and looking at its build description:

45 sec waiting;
3.3 sec build duration;
48 sec total from scheduled to completion.

The job rsync from whatever instance ran the build to fetch the caches and store them on the integration-castor05 instance with a trick which is that it is NOT rsync being run on the remote side but:

docker run --rm -i --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --entrypoint=/usr/bin/rsync ...

Which started at 19:34:45 (after the upstream project has completed) and completed at 19:34:48:

19:34:45 Started by upstream project "mediawiki-quibble-apitests-vendor-php74-docker" build number 47917
19:34:45 originally caused by:
19:34:45  Started by user unknown or anonymous
19:34:45 Running as SYSTEM
19:34:45 Building remotely on integration-castor05 (castor) in workspace /srv/jenkins/workspace/castor-save-workspace-cache
19:34:45 [ssh-agent] Looking for ssh-agent implementation...
19:34:45 [ssh-agent]   Exec ssh-agent (binary ssh-agent on a remote machine)
19:34:45 $ ssh-agent
19:34:45 SSH_AUTH_SOCK=/tmp/ssh-KqAj54SYMydS/agent.185518
19:34:45 SSH_AGENT_PID=185520
19:34:45 [ssh-agent] Started.
19:34:45 Running ssh-add (command line suppressed)
19:34:45 Identity added: /srv/jenkins/workspace/castor-save-workspace-cache@tmp/private_key_13591602856893160767.key (/srv/jenkins/workspace/castor-save-workspace-cache@tmp/private_key_13591602856893160767.key)
19:34:45 [ssh-agent] Using credentials jenkins-deploy (key to connect to labs instances set up with role::ci::slave::labs::common)
19:34:46 [castor-save-workspace-cache] $ /bin/bash -xe /tmp/jenkins227135908481326102.sh
19:34:46 + set -eu +x
19:34:46 Defined: CASTOR_NAMESPACE="mediawiki-core/master/mediawiki-quibble-apitests-vendor-php74-docker"
19:34:46 Creating directory holding cache:
19:34:46 Syncing cache
19:34:46 From.. 172.16.3.56:/srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache
19:34:46 To.... /srv/castor/mediawiki-core/master/mediawiki-quibble-apitests-vendor-php74-docker
19:34:46 + rsync --archive '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' '--rsync-path=docker run --rm -i --volume /srv/jenkins/workspace/mediawiki-quibble-apitests-vendor-php74-docker/cache:/cache --entrypoint=/usr/bin/rsync docker-registry.wikimedia.org/releng/castor:0.3.0' --delete-delay --delay-updates jenkins-deploy@172.16.3.56:/cache/ /srv/castor/mediawiki-core/master/mediawiki-quibble-apitests-vendor-php74-docker
19:34:46 Warning: Permanently added '172.16.3.56' (ECDSA) to the list of known hosts.
19:34:48 + echo -e '\nDone'
19:34:48 
19:34:48 Done
19:34:48 $ ssh-agent -k
19:34:48 unset SSH_AUTH_SOCK;
19:34:48 unset SSH_AGENT_PID;
19:34:48 echo Agent pid 185520 killed;
19:34:48 [ssh-agent] Stopped.
19:34:48 Finished: SUCCESS

Given it ran after the build has completed and the workspace got erased, docker unhelpfully creates a cache directory owned by root:root which is NOT garbage collected since the upstream build has already completed.

--rsync-path is evaluated by sh so we could add a shortcircuit to skip the rsync with something such as --rsync-path='[ -d path/to/cache ] && docker rsync....

The downstream project should be canceled which is most probably a bug in Postbuildscript Jenkins plugin or in the Parametrized trigger Jenkins plugin.

I think the issue comes from Parameterized trigger commit 6d4461c79. In TriggerBuilder.perform():

                        for (QueueTaskFuture<AbstractBuild> future : futures.get(p)) {
                            try {
                                    Run startedRun = future.waitForStart();                
...
                            } catch (CancellationException x) {                        
                                throw new AbortException(p.getFullDisplayName() +" aborted.");                                                                                                
                            }

It thus has logic to catch a CancellationException but does not catch InterruptedException which is thrown by waitForStart(). From Jenkins core:

core/src/main/java/hudson/model/queue/QueueTaskFuture.java
public interface QueueTaskFuture<R extends Executable> extends Future<R> {

    /**
     * Short for {@code getStartCondition().get()}
     */
    R waitForStart() throws InterruptedException, ExecutionException;

Thus I guess the plugin should wrap waitForStart() to catch InterruptedException then cancel the build:

core/src/main/java/hudson/model/queue/FutureImpl.java
public final class FutureImpl extends AsyncFutureImpl<Executable> implements QueueTaskFuture<Executable> {
    @Override
    public Executable waitForStart() throws InterruptedException, ExecutionException {
        return getStartCondition().get();
    }

    @Override
    public boolean cancel(boolean mayInterruptIfRunning) {
        Queue q = Jenkins.get().getQueue();
        synchronized (this) {
            if (!executors.isEmpty()) {
                if (mayInterruptIfRunning)
                    for (Executor e : executors)
                        e.interrupt();
                return mayInterruptIfRunning;
            }
            return q.cancel(task);
        }
    }

I have forked the plugin and pushed a dumb commit in at https://github.com/hashar/parameterized-trigger-plugin/tree/cancel-waiting-builds

https://github.com/jenkinsci/parameterized-trigger-plugin/compare/master...hashar:parameterized-trigger-plugin:cancel-waiting-builds#diff-1ee1a56287e0d391189d085cdcf4dd5ddc9a1c6d45856622c444b68aab50688a

src/main/java/hudson/plugins/parameterizedtrigger/TriggerBuilder.java
                                 if (future != null ) {
                                     listener.getLogger().println("Waiting for the completion of " + HyperlinkNote.encodeTo('/'+ p.getUrl(), p.getFullDisplayName()));
-                                    Run startedRun = future.waitForStart();
+                                    Run startedRun;
+                                    try {
+                                        startedRun = future.waitForStart();
+                                    } catch (InterruptedException x) {
+                                        future.cancel(true);
+                                        continue;
+                                    }
                                     listener.getLogger().println(HyperlinkNote.encodeTo('/' + startedRun.getUrl(), startedRun.getFullDisplayName()) + " started.");

I have tried to write a test to cover it but eventually gave up since well.. I don't know much about Java or Jenkins internals.

I guess I can at least catch the exception, log it and raise it again. Regardless it might be helpful to update the plugin a bit in production.

I have send a dummy patch as https://github.com/jenkinsci/parameterized-trigger-plugin/pull/363

which can be build with eg:

mvn clean package  -Drevision=2.44 -Dchangelist=-1-g$(git rev-parse --short HEAD)-T282893 \
  && (unzip -p target/parameterized-trigger.jar META-INF/MANIFEST.MF | egrep ^Plugin-Version)
...
Plugin-Version: 2.44-1-g76a5de9-T282893

~~Uploaded at https://people.wikimedia.org/~hashar/T282893/parameterized-trigger-2.44-1-g76a5de9-T282893.hpi~~

That is bugged, the canceled build ends up being flagged as a SUCCESS ;)

Mentioned in SAL (#wikimedia-operations) [2023-11-23T06:23:57Z] <hashar> Restarting CI Jenkins for plugin update # T282893

My pull request has been merged and upstream switched on continuous delivery so we get a new release automagically (787.v665fcf2a_830b_).

I went to apply the update on https://integration.wikimedia.org/ci/ which also upgraded:

  • Caffeine API
  • Script Security
  • Matrix Project
  • Conditional BuildStep
  • Parameterized Trigger

Tentatively marking it as resolved

That is still happening at eg https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium-docker/141383/console which had the build starting with a root owned cache directory:

15:17:04 + ls -laF --color=always
15:17:04 total 12
15:17:04 drwxr-xr-x  3 root           root    4096 Dec  1 15:27 ./
15:17:04 drwxrwxr-x 51 jenkins-deploy wikidev 4096 Dec  8 15:15 ../
15:17:04 drwxr-xr-x  2 root           root    4096 Dec  1 15:27 cache/

Note the build was the 3rd concurrently running one on that host as indicated by the workspace directory having a @3 suffix (/srv/jenkins/workspace/quibble-vendor-mysql-php74-noselenium-docker@3).

The leftover cache is from December 1st which unfortunately got erased half an hour ago since we only keep jobs for 7 days :) But I am assuming the postbuildscript still has some issues here and there.

Change #1028900 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] Add protection against running docker with a nonexistent host path

https://gerrit.wikimedia.org/r/1028900

Change #1028900 merged by jenkins-bot:

[integration/config@master] Add protection against running docker with a nonexistent host path

https://gerrit.wikimedia.org/r/1028900

Change #1032515 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] doxygen-publish job: Add docker-cache-dir to builders

https://gerrit.wikimedia.org/r/1032515

Change #1032515 merged by jenkins-bot:

[integration/config@master] doxygen-publish job: Add docker-cache-dir to builders

https://gerrit.wikimedia.org/r/1032515

Change #1032534 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] Create cache and log dirs where needed

https://gerrit.wikimedia.org/r/1032534

Change #1032534 merged by jenkins-bot:

[integration/config@master] Create cache and log dirs where needed

https://gerrit.wikimedia.org/r/1032534

Deployment of https://gerrit.wikimedia.org/r/c/integration/config/+/1028900 revealed that many jobs (e.g. helm-lint) were having root-owned cache or log directories auto-created by docker because they didn't exist before they were used as volumes in a docker run command. I added a workaround for that in https://gerrit.wikimedia.org/r/c/integration/config/+/1032534.

A failure refers to this task:

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php82-noselenium/4954/console

21:43:58 + exec docker run --volume /srv/jenkins/workspace/quibble-vendor-mysql-php82-noselenium/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-vendor-mysql-php82-noselenium --label jenkins.build=4954 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.3.0 load
21:43:58 ++ /usr/bin/env
21:43:58 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
21:43:59 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/quibble-vendor-mysql-php82-noselenium"
21:43:59 Syncing...
21:43:59 rsync: [Receiver] mkdir "/nonexistent" failed: Permission denied (13)
21:43:59 rsync error: error in file IO (code 11) at main.c(787) [Receiver=3.2.3]
21:43:59 
21:43:59 Done
21:43:59 [quibble-vendor-mysql-php82-noselenium] $ /bin/bash -xe /tmp/jenkins13628120926765879763.sh
21:43:59 + set -eux
21:43:59 + ls -laF --color=always
21:43:59 total 12
21:43:59 drwxr-xr-x  3 root           root    4096 May 18 19:24 ./
21:43:59 drwxrwxr-x 21 jenkins-deploy wikidev 4096 May 18 19:24 ../
21:43:59 drwxr-xr-x  2 root           root    4096 May 18 19:24 cache/
21:43:59 + mkdir -m 2777 -p log
21:43:59 mkdir: cannot create directory ‘log’: Permission denied
21:43:59 Build step 'Execute shell' marked build as failure
21:43:59 [PostBuildScript] - [INFO] Executing post build scripts.
21:43:59 [quibble-vendor-mysql-php82-noselenium] $ /bin/bash /tmp/jenkins868051900845874466.sh
21:43:59 + set -o pipefail
21:43:59 ++ pwd
21:43:59 + '[' '!' -d /srv/jenkins/workspace/quibble-vendor-mysql-php82-noselenium/log ']'
21:43:59 ++ pwd
21:43:59 + echo 'T282893: Expected host directory /srv/jenkins/workspace/quibble-vendor-mysql-php82-noselenium/log to exist but it does not.'
21:43:59 T282893: Expected host directory /srv/jenkins/workspace/quibble-vendor-mysql-php82-noselenium/log to exist but it does not.
21:43:59 + exit 1

Also
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-selenium/11708/console

A build on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/1048488 failed with "T282893: Expected host directory /srv/jenkins/workspace/quibble-vendor-mysql-php82-noselenium/log to exist but it does not."

I'm not sure what the current status of this work is. Should we report these failures, or are they a known problem? (They do seem to happen a lot less often than they used to, this is the first one I saw in a couple of weeks.)

A build on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/1048488 failed with "T282893: Expected host directory /srv/jenkins/workspace/quibble-vendor-mysql-php82-noselenium/log to exist but it does not."

I'm not sure what the current status of this work is. Should we report these failures, or are they a known problem? (They do seem to happen a lot less often than they used to, this is the first one I saw in a couple of weeks.)

Reporting on this task is still very welcome, we're whack-a-mole-ing the last of these into submission. Glad to hear it's happening less!

I've been seeing failures with this error message once a week or so recently. I (and others) usually comment in Gerrit with this task number, so you can find most of them here: https://gerrit.wikimedia.org/r/q/T282893

I see why this is still happening.

Order of operations:

  1. Job runs on a node, finishes successfully
  2. Job triggers the castor cache to save the workspace's /cache directory
  3. Job loses track of the castor job and continues to wipe the /cache dir
  4. Meanwhile, castor-save-workspace-cache.sh rsyncs the cache dirs to the castor host using docker in the rsync-path with the cache volume mounted. And the docker --volume parameter creates the host directory as root (since it doesn't exist)
  5. New job runs in that workspace and fails because the /cache dir is owned by root.

Details/specifics

The last job to run in the same workspace as the failed job (18185) mentioned by @Pppery was failed job (17976).

If you look at the PostBuildScript of 17976, you'll see:

01:15:54 [PostBuildScript] - [INFO] Executing post build scripts.
01:15:54 Waiting for the completion of castor-save-workspace-cache
01:17:39 Build aborting: cancelling queued project castor-save-workspace-cache
01:17:39 [PostBuildScript] - [ERROR] An error occured during post-build processing.
01:17:39 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
01:17:39 	at PluginClassLoader for postbuildscript//org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)

And then the job clears the directory:

01:17:52 [wmf-quibble-vendor-mysql-php74@2] $ /bin/bash -xe /tmp/jenkins3907462189240054580.sh
01:17:52 + echo 'Listing potentially remaining files in workspace for T282893'
01:17:52 Listing potentially remaining files in workspace for T282893
01:17:52 + ls -laF --color=always
01:17:52 total 8
01:17:52 drwxr-xr-x  2 jenkins-deploy wikidev 4096 Sep 17 07:17 ./
01:17:52 drwxrwxr-x 39 jenkins-deploy wikidev 4096 Sep 17 07:17 ../
01:17:52 [wmf-quibble-vendor-mysql-php74@2] $ /bin/bash -xe /tmp/jenkins11651668535019301603.sh
01:17:52 + set -u
01:17:52 + rmdir /srv/jenkins/workspace/wmf-quibble-vendor-mysql-php74@2

THEN, in 18185, you see:

09:20:18 + ls -laF --color=always
09:20:18 total 12
09:20:18 drwxr-xr-x  3 root           root    4096 Sep 17 07:18 ./
09:20:18 drwxrwxr-x 37 jenkins-deploy wikidev 4096 Sep 18 15:19 ../
09:20:18 drwxr-xr-x  2 root           root    4096 Sep 17 07:18 cache/

Note the timestamps:

  • 2024-09-17 07:17: In 17976, the parent directory of the cache directory was removed
  • 2024-09-17 07:18: In 18185, the cache directory exists with this ctime

This means, something created that directory between those two jobs. And the only thing that could have is the lost castor-workspace-save-cache job. And the way that job mounts directories would cause this problem.

This comment was removed by Pppery.

Change #1074573 had a related patch set uploaded (by Thcipriani; author: Thcipriani):

[integration/config@master] castor-save-workspacecache: check remote dir exists

https://gerrit.wikimedia.org/r/1074573

Change #1143972 had a related patch set uploaded (by Hashar; author: Bartosz Dziewoński):

[integration/config@master] Omit noisy `ls` debugging commands when not needed

https://gerrit.wikimedia.org/r/1143972

Mentioned in SAL (#wikimedia-releng) [2025-05-12T09:52:33Z] <hashar> Updating all jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1143972 "Omit noisy ls debugging commands when not needed" # T282893 & T393847

Change #1143972 merged by jenkins-bot:

[integration/config@master] Omit noisy `ls` debugging commands when not needed

https://gerrit.wikimedia.org/r/1143972

Still seen.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1152349
https://integration.wikimedia.org/ci/job/mwgate-node20/57567/console

+ exec docker run --volume /srv/jenkins/workspace/mwgate-node20/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=mwgate-node20 --label jenkins.build=57567 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.4.0 load
++ set +x
Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/wmf-1.45.0-wmf.3/mwgate-node20"
Syncing...
rsync: [sender] change_dir "/castor-mw-ext-and-skins/wmf-1.45.0-wmf.3/mwgate-node20" (in caches) failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1819) [Receiver=3.2.3]
rsync: [Receiver] read error: Connection reset by peer (104)

Done
[mwgate-node20] $ /bin/bash -xe /tmp/jenkins2685256128437656199.sh
+ set -eux
+ mkdir -m 2777 -p log
mkdir: cannot create directory ‘log’: Permission denied
+ ls -laF --color=always
total 12
drwxr-xr-x  3 root           root    4096 May 30 19:52 ./
drwxrwxr-x 37 jenkins-deploy wikidev 4096 May 30 20:11 ../
drwxr-xr-x  2 root           root    4096 May 30 19:52 cache/
+ false
Build step 'Execute shell' marked build as failure

Change #1152248 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: set cache directory ownership to nobody:nogroup

https://gerrit.wikimedia.org/r/1152248

Change #1152248 merged by jenkins-bot:

[integration/config@master] jjb: set cache directory ownership to nobody:nogroup

https://gerrit.wikimedia.org/r/1152248

Change #1152761 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Revert "jjb: drop obsolete creation of log & src dirs"

https://gerrit.wikimedia.org/r/1152761

Change #1152761 merged by jenkins-bot:

[integration/config@master] Revert "jjb: drop obsolete creation of log & src dirs"

https://gerrit.wikimedia.org/r/1152761