Page MenuHomePhabricator

'No such file or directory' CI failures in multiple repos
Closed, ResolvedPublicBUG REPORT

Description

These kind of errors show up all over CI. Not sure what caused these.

`FileNotFoundError: [Errno 2] No such file or directory: './node_modules/.bin/grunt': './node_modules/.bin/grunt'` on https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71031/consoleFull but no obvious `npm install
[...]
/34f1dfea39f06111c930967d7ecece596c9ff4bc.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/1eb0d80069cf91af36d06672de4f94197cb8fbb6.zip" (from composer/files/symfony/console/.~tmp~/1eb0d80069cf91af36d06672de4f94197cb8fbb6.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/3fa0ff2ed97fbda7025533c40b4641063f5de0b3.zip" (from composer/files/symfony/console/.~tmp~/3fa0ff2ed97fbda7025533c40b4641063f5de0b3.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/5aad8e7bb181551d19e204c4e7c5bca91ab6d892.zip" (from composer/files/symfony/console/.~tmp~/5aad8e7bb181551d19e204c4e7c5bca91ab6d892.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/6add331cbe4f3a99ee595b224fc2ec32d1594946.zip" (from composer/files/symfony/console/.~tmp~/6add331cbe4f3a99ee595b224fc2ec32d1594946.zip): No such file or directory (2)
22:50:48 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [generator=3.1.2]
22:50:48 
22:50:48 Done
22:50:48 [mediawiki-core-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins8753552369144288140.sh
22:50:48 FATAL: command execution failed
22:50:48 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to integration-agent-docker-1038
22:50:48 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1797)
22:50:48 		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
22:50:48 		at hudson.remoting.Channel.call(Channel.java:1001)
22:50:48 		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1122)
22:50:48 		at hudson.Launcher$ProcStarter.start(Launcher.java:507)
22:50:48 		at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:143)
22:50:48 		at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:91)
22:50:48 		at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
[...]

https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71027/console
https://integration.wikimedia.org/ci/job/mediawiki-core-php72-phan-docker/61638/console

Event Timeline

Zabe renamed this task from 'No such file or directory' CI failures in multplie repos to 'No such file or directory' CI failures in multiple repos.Jan 26 2022, 11:34 PM

Maybe related to T252071: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye?

Mentioned in SAL (#wikimedia-releng) [2022-01-26T20:29:12Z] <hashar> Completed migration of integration-agent-docker-XXXX instances from Stretch to Bullseye - T252071

Zabe triaged this task as High priority.Jan 26 2022, 11:46 PM

Looking at https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71027/consoleFull

23:15:39 ............................................................. 3416 / 4775 ( 71%)
23:15:39 ............................................................. 3477 / 4775 ( 72%)
23:15:40 ............................................................. 3538 / 4775 ( 74%)
23:15:53 ............................[3fce00e6b3fa0eac7d55b4ed] [no req]   MWException: MessagesEn.php is missing.
23:15:53 Backtrace:
23:15:53 from /workspace/src/includes/cache/localisation/LocalisationCache.php(498)
23:15:53 #0 /workspace/src/includes/cache/localisation/LocalisationCache.php(370): LocalisationCache->initLanguage(string)

The file has vanished? /workspace/src from inside the container is a volume mount from the host:

docker run --volume /srv/jenkins/workspace/workspace/wmf-quibble-core-vendor-mysql-php72-docker/src:/workspace/src

https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71032/console

23:25:12 INFO:zuul.Cloner:Updating origin remote in repo mediawiki/core to https://gerrit.wikimedia.org/r/mediawiki/core
23:25:15 INFO:zuul.Cloner:upstream repo has branch master
23:25:54 INFO:zuul.Cloner:Falling back to branch master
23:25:54 ERROR:zuul.Repo:Unable to initialize repo for /workspace/src
23:25:54 Traceback (most recent call last):
23:25:54   File "/usr/local/lib/python3.7/dist-packages/zuul/merger/merger.py", line 90, in createRepoObject
23:25:54     self._ensure_cloned()
23:25:54   File "/usr/local/lib/python3.7/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
23:25:54     git.Repo.clone_from(self.remote_url, self.local_path)
23:25:54   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 988, in clone_from
23:25:54     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
23:25:54   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 939, in _clone
23:25:54     finalize_process(proc, stderr=stderr)
23:25:54   File "/usr/lib/python3/dist-packages/git/util.py", line 333, in finalize_process
23:25:54     proc.wait(**kwargs)
23:25:54   File "/usr/lib/python3/dist-packages/git/cmd.py", line 415, in wait
23:25:54     raise GitCommandError(self.args, status, errstr)
23:25:54 git.exc.GitCommandError: Cmd('git') failed due to: exit code(1)
23:25:54   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/core /workspace/src
23:25:54   stderr: 'Cloning into '/workspace/src'...
23:25:54 /workspace/src/.git: No such file or directory
23:25:54 '

The process is cloning the repo but .git is never written ?


It is a bit late here (1am), the few ideas:

  • some other process or other job running on the machine happens to be deleting files from the disc
  • the Docker version we use is broken?
  • or maybe the underlying drive / Ceph has issue ?

If it happens solely on a specific CI agent, it can be put offline to prevent builds from executing.

It is neither Ceph nor Docker. Somehow multiple agents in Jenkins are connected to the same instance integration-agent-docker-1033:

hashar@integration-agent-docker-1033:~$ ps -u jenkins-deploy f
    PID TTY      STAT   TIME COMMAND
 588996 ?        S      0:00 sshd: jenkins-deploy@notty
 588999 ?        Ssl    0:00  \_ docker run --rm -i --volume /srv/jenkins/workspace/workspace/mediawiki-quibble-vendor-sqlite-php72-docker/cache:/cache --entrypoint=/usr/bin/rsync docker-reg
  37832 ?        S      0:43 sshd: jenkins-deploy@notty
  37873 ?        Ssl    3:26  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 589048 ?        Sl     0:00      \_ docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/workspace/mwext-node12-rundoc-docker:/workspace --security-opt seccom
 589050 ?        Z      0:00          \_ [bash] <defunct>
  37848 ?        S      1:04 sshd: jenkins-deploy@notty
  37884 ?        Ssl    4:10  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 585446 ?        Sl     0:00      \_ docker run --entrypoint=/usr/bin/find --user=root --volume /srv/jenkins/workspace/workspace/wmf-quibble-selenium-php72-docker:/workspace --security-opt s
 585448 ?        Z      0:00          \_ [bash] <defunct>
  37852 ?        S      0:45 sshd: jenkins-deploy@notty
  37874 ?        Ssl    3:38  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 589061 ?        Sl     0:00      \_ docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker:/workspace --sec
 589063 ?        Z      0:00          \_ [bash] <defunct>
  37838 ?        S      0:35 sshd: jenkins-deploy@notty
  37872 ?        Ssl    3:13  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 586755 ?        Sl     0:00      \_ docker run --tmpfs /workspace/db:size=320M --volume /srv/jenkins/workspace/workspace/wmf-quibble-core-vendor-mysql-php72-docker/src:/workspace/src --volu
 586761 ?        Z      0:00          \_ [bash] <defunct>
   2002 ?        S      1:13 sshd: jenkins-deploy@notty
   2006 ?        Ssl    5:05  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
   1965 ?        Ss     0:40 /lib/systemd/systemd --user

The primary would trigger multiple builds of the same job (ex: wmf-quibble-core-vendor-mysql-php72-docker) on all those agents but each of them end up sharing the same directory on the machine: /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker. Which is definitely messy, makes all builds unreliable.

The agent configuration were pointing to their WMCS instances IP. https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1027/jobConfigHistory/showDiffFiles?timestamp1=2022-01-26_18-53-32&timestamp2=2022-01-26_18-53-41

That config diff shows that the agent integration-agent-docker-1027 has been created with the IP 172.16.6.180 which is actually the agent 1026

I have created the instance by https://integration.wikimedia.org/ci/computer/new which offers to copy the configuration from another node. I guess that borrows the IP from the source node to the new node (as well as all other parameters which is handy).

My fringe 1am30 theory is that Jenkins immediately establish a connection before the final parameters are saved such as ... changing the IP to the proper one. That would be a bug for Jenkins and a big note added to https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup

hashar lowered the priority of this task from High to Medium.Jan 27 2022, 12:37 AM

Solved by team work over IRC #wikimedia-releng with bunch of nice suggestions, leads and ruling out other bits of the infra.

This comment was removed by hashar.
hashar claimed this task.

Mentioned in SAL (#wikimedia-releng) [2022-01-27T16:00:28Z] <hashar> Pooling back agents 1035 1036 1037 1038 , they could not connect due to ssh host mismatch since yesterday they all got attached to instance 1033 and accepted that host key # T300214

The summary of the issue is that we had several Jenkins agents connected to the same instance and job ended up sharing the same workspace directory thus the workspaces were used by two different builds which caused the issues encountered.

When I have created the agents I have used the Copy From Existing Agent, my intuition is that maybe Jenkins would start connecting the new agent in the background albeit using the copied agent IP. I was not able to reproduce it this morning: the new agent does not connect until it is saved.

What I think happened is that I have created a fleet of agents and forgot to adjust their IP addresses, saved them and the four or five new agents ended up piling on each others. Or something along that way.

A follow up task is to use the instance FQDN rather than it is IP address in the agent configuration. That is T300224

Given I can not reproduce and everything is properly configured and connected now, I am closing this.