Page MenuHomePhabricator

jenkins.job=mediawiki-i18n-check-docker fails: couldnot lock config file .gitconfig: File exists
Closed, ResolvedPublicBUG REPORT

Description

Since ~ 2 weeks the jenkins.job=mediawiki-i18n-check-docker fails sometimes without showing a real error.

Example: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/893960 with https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/123994/console

A recheck fixes the issue then.

+ git fetch --quiet --update-head-ok --depth 2 git://contint1002.wikimedia.org/mediawiki/extensions/DiscussionTools +refs/zuul/REL1_39/Z3f9e336152c94858861b98ac8d8d3491:refs/zuul/REL1_39/Z3f9e336152c94858861b98ac8d8d3491
+ '[' -z REL1_39 ']'
+ git checkout -B REL1_39 FETCH_HEAD
Switched to a new branch 'REL1_39'
+ set +x
+ git submodule --quiet update --init --recursive
[mediawiki-i18n-check-docker] $ /bin/bash /tmp/jenkins17197787001907672904.sh
+ cd src
++ pwd
+ git config --global --add safe.directory /srv/jenkins/workspace/mediawiki-i18n-check-docker/src
error: could not lock config file /mnt/home/jenkins-deploy/.gitconfig: File exists

Which is due to T329266 adding the safe.directory setting.

Event Timeline

Recent example: https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/127709/console

09:15:02 [mediawiki-i18n-check-docker] $ /bin/bash /tmp/jenkins17197787001907672904.sh
09:15:02 + cd src
09:15:02 ++ pwd
09:15:02 + git config --global --add safe.directory /srv/jenkins/workspace/mediawiki-i18n-check-docker/src
09:15:02 error: could not lock config file /mnt/home/jenkins-deploy/.gitconfig: File exists
09:15:02 Build step 'Execute shell' marked build as failure

Fully manual review of big changes is time consuming.

Today ~ 200 repros failed without showing a real error :-(

Example: https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/128796/console

11:45:00 + git fetch --quiet --update-head-ok --depth 2 git://contint1002.wikimedia.org/mediawiki/extensions/DiscussionTools +refs/zuul/REL1_39/Z3f9e336152c94858861b98ac8d8d3491:refs/zuul/REL1_39/Z3f9e336152c94858861b98ac8d8d3491
11:45:02 + '[' -z REL1_39 ']'
11:45:02 + git checkout -B REL1_39 FETCH_HEAD
11:45:02 Switched to a new branch 'REL1_39'
11:45:02 + set +x
11:45:02 + git submodule --quiet update --init --recursive
11:45:02 [mediawiki-i18n-check-docker] $ /bin/bash /tmp/jenkins17197787001907672904.sh
11:45:02 + cd src
11:45:02 ++ pwd
11:45:02 + git config --global --add safe.directory /srv/jenkins/workspace/mediawiki-i18n-check-docker/src
11:45:02 error: could not lock config file /mnt/home/jenkins-deploy/.gitconfig: File exists
11:45:02 Build step 'Execute shell' marked build as failure
11:45:02 [PostBuildScript] - [INFO] Executing post build scripts.
11:45:03 [mediawiki-i18n-check-docker] $ /bin/bash -xe /tmp/jenkins9911740476435647155.sh

And an example of a different error: https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/128796/console; I'm not sure what's causing the error here:

13:26:15 Building remotely on integration-agent-docker-1023 (pipelinelib Docker blubber) in workspace /srv/jenkins/workspace/mediawiki-i18n-check-docker@4
13:26:16 [mediawiki-i18n-check-docker@4] $ /bin/bash -xe /tmp/jenkins6763737696491111990.sh
13:26:16 + set -eux
13:26:16 + mkdir -m 2777 -p src
13:26:16 [mediawiki-i18n-check-docker@4] $ /bin/bash /tmp/jenkins8326925099231278844.sh
13:26:16 + set -o pipefail
13:26:16 + exec docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/mediawiki-i18n-check-docker@4:/workspace --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-i18n-check-docker --label jenkins.build=128796 --env-file /dev/fd/63 docker-registry.wikimedia.org/buster:latest /workspace/src -mindepth 1 -delete
13:26:16 ++ /usr/bin/env
13:26:16 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
13:26:17 [mediawiki-i18n-check-docker@4] $ /bin/bash /tmp/jenkins9222105025128184359.sh
13:26:17 + set -o pipefail
13:26:17 ++ pwd
13:26:17 ++ pwd
13:26:17 + exec docker run --volume /srv/jenkins/workspace/mediawiki-i18n-check-docker@4/src:/src --volume /srv/jenkins/workspace/mediawiki-i18n-check-docker@4/cache:/cache --volume /srv/git:/srv/git:ro --security-opt seccomp=unconfined --init --rm --label jenkins.job=mediawiki-i18n-check-docker --label jenkins.build=128796 --env-file /dev/fd/63 docker-registry.wikimedia.org/releng/ci-src-setup-simple:0.4.2-s1
13:26:17 ++ /usr/bin/env
13:26:17 ++ egrep -v '^(HOME|SHELL|PATH|LOGNAME|MAIL)='
13:26:18 + git init
13:26:19 Initialized empty Git repository in /src/.git/
13:26:19 + git fetch --quiet --update-head-ok --depth 2 git://contint2001.wikimedia.org/mediawiki/extensions/CommentStreams +refs/zuul/master/Zac8e2d574e744cb6a91aa9031608fc82:refs/zuul/master/Zac8e2d574e744cb6a91aa9031608fc82
13:26:19 + '[' -z master ']'
13:26:19 + git checkout -B master FETCH_HEAD
13:26:19 Reset branch 'master'
13:26:19 + set +x
13:26:19 + git submodule --quiet update --init --recursive
13:26:19 [mediawiki-i18n-check-docker@4] $ /bin/bash /tmp/jenkins13449682070583816511.sh
13:26:19 + cd src
13:26:19 ++ pwd
13:26:19 + git config --global --add safe.directory /srv/jenkins/workspace/mediawiki-i18n-check-docker@4/src
13:26:19 ++ mktemp
13:26:19 + additions=/tmp/tmp.u9XQKja3le
13:26:19 + git show FETCH_HEAD -U0
13:26:19 + grep '^+'
13:26:19 Build step 'Execute shell' marked build as failure
13:26:19 [PostBuildScript] - [INFO] Executing post build scripts.
13:26:19 [mediawiki-i18n-check-docker@4] $ /bin/bash -xe /tmp/jenkins4904502183614127051.sh
13:26:19 + set -euxo pipefail
13:26:19 + docker ps -q --filter label=jenkins.job=mediawiki-i18n-check-docker --filter label=jenkins.build=128796
13:26:19 + xargs --no-run-if-empty docker stop
13:26:19 [PostBuildScript] - [INFO] Executing post build scripts.

Today ~ 200 repros failed without showing a real error :-(

The main reason for the huge number is, that I had a network error on my side and restarted the export some minutes later. But I did not check if there are patches to review.

But that do not explain, why all (?) 2nd patches per repo fail for mediawiki-i18n-check-docker.

Still happening:

08:47:36 + git config --global --add safe.directory /srv/jenkins/workspace/mediawiki-i18n-check-docker@2/src
08:47:36 error: could not lock config file /mnt/home/jenkins-deploy/.gitconfig: File exists
08:47:36 Build step 'Execute shell' marked build as failure

https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/154955/console

I'd like to request assistance with this task.

Reviewing translation updates manually is tedious and often boring work. We'd like to keep the number of patches needing manual review as low as possible.

We'd appreciate any help towards making the i18n-checker job more reliable.

Still happening:

08:47:36 + git config --global --add safe.directory /srv/jenkins/workspace/mediawiki-i18n-check-docker@2/src
08:47:36 error: could not lock config file /mnt/home/jenkins-deploy/.gitconfig: File exists
08:47:36 Build step 'Execute shell' marked build as failure

https://integration.wikimedia.org/ci/job/mediawiki-i18n-check-docker/154955/console

Oh. A lot of these are running in parallel, yes? This seems like a parallelism problem operating on the .gitconfig file; i.e., multiple jobs trying to edit at the same time.

This looks like fallout from T329266. Since this job isn't running within a container each job is trying to modify the .gitconfig file at the same time (which fails due to the first job creating a .gitconfig.lock)

Yeah we push a lot of patches to Gerrit in a short time frame.

hashar renamed this task from jenkins.job=mediawiki-i18n-check-docker fails without a real error to jenkins.job=mediawiki-i18n-check-docker fails: couldnot lock config file .gitconfig: File exists.Feb 20 2024, 9:35 AM
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)

Change 1005049 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: enhance mediawiki-i18n-check job

https://gerrit.wikimedia.org/r/1005049

Another issue is the check failing WITHOUT any output such as:

13:26:19 + git show FETCH_HEAD -U0
13:26:19 + grep '^+'
13:26:19 Build step 'Execute shell' marked build as failure

The reason is that when the proposed change is behind the target branch, such as in:

* (master)
| * (change 1234,99) Localisation update
|/
*  whatever common parent

What Zuul is doing behind the hood, is that it merge the change with the branch and mark it with a reference (that shows in the Jenkins builds parameters as the ZUUL_REF parameter). So you get:

* (refs/zuul/master/Z987654321) Merge commit 'change (1234,99)' into HEAD
| \
| * (change 1234,99) Localisation update
* | (master)
|/
*  whatever common parent

And CI does fetch that refs/zuul/master/Z987654321. The script does a git show FETCH_HEAD which for a merge commit defaults to not showing anything at all. There is no diff, nothing matches ^+, grep fails and the build fails with a mysterious empty output.

That exactly matches:

But that do not explain, why all (?) 2nd patches per repo fail for mediawiki-i18n-check-docker.

Cause I guess the 2nd patch causes Zuul to do a merge commit. The fix is to use git show -m --first-parent which cause it to show the difference for a merge commit following the first parent (master). I have battle tested that via something name git-changed-in-head and which I have ported in Quibble:

+        # Some explanations for the git command below:
+        # HEAD^ will not exist for an initial commit, we thus need `git show`
+        # --name-only: strip patch payload, only report the file being altered
+        # --diff-filter=ACM: only care about files Added, Copied or Modified
+        # --find-renames=100%: renamed files that had a slight change would be
+        #                      considered modified and thus included.
+        # -m: show differences for merge commits ...
+        # --first-parent: ... but only follow the first parent commit
+        # --format=format: : strip out the commit summary
+        cmd = [
+            'git', 'show', 'HEAD',
+            '--name-only',
+            '--diff-filter=ACM',
+            '--find-renames=100%',
+            '-m',
+            '--first-parent',
+            '--format=format:',
+        ]

Change 1005774 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: fix mediawiki-i18n-check-docker empty error

https://gerrit.wikimedia.org/r/1005774

The test case I use was a failing build of the checker. That was for mediawiki/extensions/GlobalBlocking change 1005424. The job failed with an empty error:

++ mktemp
+ additions=/tmp/tmp.RB2gqyOAHT
+ git show FETCH_HEAD -U0
+ grep '^+'
Build step 'Execute shell' marked build as failure

Which is how I found out the empty error issue.

After deploying the JJB patches, the job passes now:

set -euo pipefail; git show -m --first-parent --find-renames=100% FETCH_HEAD -U0 | grep '\''^+'\'' | tee /log/additions.txt'
+++ b/i18n/is.json
+	"globalblocking-block-ipinvalid": "IP-staðfangið ($1) sem þú ritaðir er ógilt.\nVinsamlegast athugaðu að þú getur ekki ritað notandanafn!",

Change 1005774 merged by jenkins-bot:

[integration/config@master] jjb: fix mediawiki-i18n-check-docker empty error

https://gerrit.wikimedia.org/r/1005774

Change 1005049 merged by jenkins-bot:

[integration/config@master] jjb: enhance mediawiki-i18n-check job

https://gerrit.wikimedia.org/r/1005049

Summary

Since ~ 2 weeks the jenkins.job=mediawiki-i18n-check-docker fails sometimes without showing a real error.

I am confident that was due to some of the l10n change being tested by CI as merge commits of the change + the tip of the branch. With git show not showing the diff of merges, that did caused the empty error. Using git show -m --first-parent fixed it.

error: could not lock config file /mnt/home/jenkins-deploy/.gitconfig: File exists

That was due to a race condition between builds using git config to set safe.directory=$(pwd). The fix I have made is to move the logic to execute inside a container as it should.

I believe the issues are fixed, and at least that worked for the test case I had (rebuild the last failing build in CI). Things to check is whether the job still catch issues ;)

Thanks so much @hashar. Let's give a few days more to see if there are any unexpected issues and then close.

One thing I noticed is that I have to scroll a bit more now that all the additions are visible in the log. Not a problem for me though.