Page MenuHomePhabricator

Multiple *-pipeline-test jobs failing to load pipelinelib with git error
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

What happens?:

https://integration.wikimedia.org/ci/job/striker-pipeline-test/478/console:

Started by upstream project "trigger-striker-pipeline-test" build number 478
originally caused by:
 Started by user unknown or anonymous
Loading library wikimedia-integration-pipelinelib@master
Attempting to resolve master from remote references...
 > git --version # timeout=10
 > git --version # timeout=10
 > git ls-remote https://gerrit.wikimedia.org/r/integration/pipelinelib # timeout=10
ERROR: Checkout failed
java.io.IOException: error=0, Failed to exec spawn helper: pid: 1312994, exit value: 1
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:314)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:244)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1110)
Caused: java.io.IOException: Cannot run program "git": error=0, Failed to exec spawn helper: pid: 1312994, exit value: 1

Event Timeline

A build from a few days ago (2025-02-16) shows what is expected to happen in the currently crashing section of test code:

Started by upstream project "trigger-striker-pipeline-test" build number 474
originally caused by:
 Started by user unknown or anonymous
Loading library wikimedia-integration-pipelinelib@master
Attempting to resolve master from remote references...
 > git --version # timeout=10
 > git --version # 'git version 2.30.2'
 > git ls-remote -- https://gerrit.wikimedia.org/r/integration/pipelinelib # timeout=10
Found match: refs/heads/master revision f87b0b853aac54ccf13abd66f570105326a58177
Using checkout strategy: SpecificRevisionBuildChooser
Last Built Revision: Revision f87b0b853aac54ccf13abd66f570105326a58177 (master)

I think the failures are happening on the Jenkins server itself. The crash has been very reproducible today via gerrit/zuul/Jenkins triggering the job. It does not appear to be a system wide issue with git if the following console session is to be believed:

$ ssh contint.wikimedia.org
$ hostname -f
contint1002.wikimedia.org
$ sudo su -
# sudo -iu jenkins git --version
git version 2.30.2
# sudo -iu jenkins git ls-remote https://gerrit.wikimedia.org/r/integration/pipelinelib
f87b0b853aac54ccf13abd66f570105326a58177        HEAD
1d86c2f9c92546712a4b6a16e222f7b703136aa9        refs/changes/00/451400/1
2a34242e4bed8c70ffc50982a45f819afb8ba4a2        refs/changes/00/451400/2
48e7a66978e00c51098398ab9d618496d7fc67f2        refs/changes/00/451400/meta
...
bd808 renamed this task from striker-pipeline-test failing to load pipelinelib with git error to Multiple *-pipeline-test jobs failing to load pipelinelib with git error.Feb 19 2025, 6:13 PM
bd808 added subscribers: KartikMistry, hashar, Aklapper, abi_.
bd808 triaged this task as Unbreak Now! priority.Feb 19 2025, 6:16 PM

Bumping priority to UBN! as this regression is blocking at least 3 production deployed services from passing CI.

Trying to narrow down when things went sideways, https://integration.wikimedia.org/ci/view/All%20jobs/job/cxserver-pipeline-test/840/ last passed at 2025-02-18T09:19:32Z.

Started by upstream project "trigger-cxserver-pipeline-test" build number 839
originally caused by:
 Started by user unknown or anonymous
Loading library wikimedia-integration-pipelinelib@master
Attempting to resolve master from remote references...
 > git --version # timeout=10
 > git --version # 'git version 2.30.2'
 > git ls-remote -- https://gerrit.wikimedia.org/r/integration/pipelinelib # timeout=10
Found match: refs/heads/master revision

This feels like very familiar to T385553/T377803, and Java was upgraded yesterday on contint1002 per the Apt history log file so that matches too. Maybe let's try restarting the Jenkins service since that fixes the Puppet case of this error?

This feels like very familiar to T385553/T377803, and Java was upgraded yesterday on contint1002 per the Apt history log file so that matches too. Maybe let's try restarting the Jenkins service since that fixes the Puppet case of this error?

That seems reasonable. I'm also seeing that the gitclient plugin is 5.0.0 but there was a fix in 5.0.1 related to listing remote branches https://github.com/jenkinsci/git-client-plugin/pull/1242

Mentioned in SAL (#wikimedia-operations) [2025-02-19T19:35:12Z] <dduvall> restarting jenkins to fix git related issues following java update (T386755)

Mentioned in SAL (#wikimedia-releng) [2025-02-19T19:35:27Z] <dduvall> restarting jenkins to fix git related issues following java update (T386755)

Following a botched "safe" restart and subsequent systemctl restart jenkins, the issue seems to be resolved.

Started by user dduvall
Loading library wikimedia-integration-pipelinelib@master
Caching library wikimedia-integration-pipelinelib@master
Attempting to resolve master from remote references...
 > git --version # timeout=10
 > git --version # 'git version 2.30.2'
 > git ls-remote -- https://gerrit.wikimedia.org/r/integration/pipelinelib # timeout=10
Found match: refs/heads/master revision f87b0b853aac54ccf13abd66f570105326a58177
Using checkout strategy: SpecificRevisionBuildChooser
Selected Git installation does not exist. Using Default
hashar added a subscriber: dancy.

This occurred again (T394817) and @dancy found in the journal:

May 20 17:03:14 contint1002 jenkins[512699]: Incorrect Java version: 17.0.14+7-Debian-1deb11u1
May 20 17:03:14 contint1002 jenkins[512699]: jspawnhelper version 17.0.15+6-Debian-1deb11u1
May 20 17:03:14 contint1002 jenkins[512699]: This command is not for general use and should only be run as the result of a call to
May 20 17:03:14 contint1002 jenkins[512699]: ProcessBuilder.start() or Runtime.exec() in a java application

The root cause is we have upgraded Java and did not restart Jenkins immediately. That leads a java version mismatch when Jenkins spawns a sub process (as I understand it).