Page MenuHomePhabricator

Various CI jobs failing after "mkdir: cannot create directory ‘log’: Permission denied"
Open, Needs TriagePublic

Description

https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/112436/console

21:09:00 [mwext-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins3384731697099042303.sh
21:09:00 + set -eux
21:09:00 + mkdir -m 2777 -p log
21:09:00 mkdir: cannot create directory ‘log’: Permission denied
21:09:00 Build step 'Execute shell' marked build as failure

The jobs usually passing after resubmit. Have seen this on various jobs, this one is just an example.

Event Timeline

From that same run:

13:08:58 [mwext-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins7676320888490307308.sh
13:08:58 + mkdir -m 2777 -p cache

^ worked without error. Bizarre.

hashar claimed this task.

It might be due to a job completing abruptly and not running the publisher step that wipe up the workspace. Roughly it runs a find . -delete as root. That cleaner did run when the job failed to create the log directory and thus the next build on the host worked. It is essentially self fixing.

The previous run of that job on that agent apparently did run the find -delete. So I don't know what might have happened :-\

I am not quite sure it will happen, seems to be just a one off error. But thank you to have reported it at least!

hashar reopened this task as Open.EditedTue, May 18, 9:16 PM

I assumed it was just a one time issue when clearly it is not :(

I guess we can revisit how the directories are created on the host so that instead of mkdir -m 2777 {cache,log,src} we use a container that runs as nobody, ensure the directory are wiped and recreate them.

On integration-agent-docker-1002.integration.eqiad1.wikimedia.cloud there is a job workspace owned by root:root which would explain the failure to create a log directory:

drwxr-xr-x 3 root root 4096 May 25 07:11 /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2

It got generated as part of this build: https://integration.wikimedia.org/ci/job/quibble-composer-mysql-php74-noselenium-docker/6967/console

As jenkins-deploy user it mkdir -m 2777 -p cache. The build eventually fails due to:

07:11:37 [PostBuildScript] - [INFO] Executing post build scripts.
07:11:37 Waiting for the completion of castor-save-workspace-cache
07:11:37 [PostBuildScript] - [ERROR] An error occured during post-build processing.
07:11:37 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:190)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processScripts(Processor.java:91)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:79)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.process(Processor.java:73)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.PostBuildScript.perform(PostBuildScript.java:116)
07:11:37 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
07:11:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:803)
07:11:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:752)
07:11:37 	at hudson.model.Build$BuildExecution.post2(Build.java:177)
07:11:37 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:697)
07:11:37 	at hudson.model.Run.execute(Run.java:1932)
07:11:37 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
07:11:37 	at hudson.model.ResourceController.execute(ResourceController.java:97)
07:11:37 	at hudson.model.Executor.run(Executor.java:429)
07:11:37 Caused by: java.lang.InterruptedException
07:11:37 	at java.base/java.lang.Object.wait(Native Method)
07:11:37 	at java.base/java.lang.Object.wait(Object.java:328)
07:11:37 	at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:79)
07:11:37 	at hudson.model.queue.FutureImpl.waitForStart(FutureImpl.java:68)
07:11:37 	at hudson.plugins.parameterizedtrigger.TriggerBuilder.perform(TriggerBuilder.java:146)
07:11:37 	at org.jenkinsci.plugins.postbuildscript.processor.Processor.processBuildSteps(Processor.java:180)
07:11:37 	... 13 more
07:11:37 Build step 'Execute scripts' changed build result to FAILURE
07:11:37 Build step 'Execute scripts' marked build as failure

No idea what java.lang.InterruptedException is about.

The job then runs:

07:11:38 Clearing /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2/cache
07:11:38 + exec docker run --volume /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-composer-mysql-php74-noselenium-docker --label jenkins.build=6967 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.2.1 clear

And finally:

rmdir /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker@2

Another build of the exact same job ran concurrently https://integration.wikimedia.org/ci/job/quibble-composer-mysql-php74-noselenium-docker/6965/ and got aborted by Zuul. That job ran in quibble-composer-mysql-php74-noselenium-docker. The post build script do:

07:11:45 exec docker run --volume /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker/cache:/cache --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-composer-mysql-php74-noselenium-docker --label jenkins.build=6965 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/castor:0.2.1 clear

07:11:49 + docker ps -q --filter label=jenkins.job=quibble-composer-mysql-php74-noselenium-docker --filter label=jenkins.build=6965
07:11:49 + xargs --no-run-if-empty docker stop
07:11:49 + exec docker run --user=root --volume /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker:/workspace --entrypoint=/usr/bin/find --security-opt seccomp=unconfined --init --rm --label jenkins.job=quibble-composer-mysql-php74-noselenium-docker --label jenkins.build=6965 --env-file /dev/fd/63 docker-registry.wikimedia.org/wikimedia-buster:latest /workspace -mindepth 1 -delete

And finally:

07:11:52 + rmdir /srv/jenkins/workspace/workspace/quibble-composer-mysql-php74-noselenium-docker
07:11:52 Finshed: ABORTED

I guess there is a race condition somewhere that causes the cache directory to be recreated by the Docker daemon somehow :-\