'No such file or directory' CI failures in multiple repos
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Zabe
	Jan 26 2022, 11:34 PM

Description

These kind of errors show up all over CI. Not sure what caused these.

`FileNotFoundError: [Errno 2] No such file or directory: './node_modules/.bin/grunt': './node_modules/.bin/grunt'` on https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71031/consoleFull but no obvious `npm install

[...]
/34f1dfea39f06111c930967d7ecece596c9ff4bc.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/1eb0d80069cf91af36d06672de4f94197cb8fbb6.zip" (from composer/files/symfony/console/.~tmp~/1eb0d80069cf91af36d06672de4f94197cb8fbb6.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/3fa0ff2ed97fbda7025533c40b4641063f5de0b3.zip" (from composer/files/symfony/console/.~tmp~/3fa0ff2ed97fbda7025533c40b4641063f5de0b3.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/5aad8e7bb181551d19e204c4e7c5bca91ab6d892.zip" (from composer/files/symfony/console/.~tmp~/5aad8e7bb181551d19e204c4e7c5bca91ab6d892.zip): No such file or directory (2)
22:50:48 rsync: rename failed for "/cache/composer/files/symfony/console/6add331cbe4f3a99ee595b224fc2ec32d1594946.zip" (from composer/files/symfony/console/.~tmp~/6add331cbe4f3a99ee595b224fc2ec32d1594946.zip): No such file or directory (2)
22:50:48 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1668) [generator=3.1.2]
22:50:48 
22:50:48 Done
22:50:48 [mediawiki-core-php72-phan-docker] $ /bin/bash -xe /tmp/jenkins8753552369144288140.sh
22:50:48 FATAL: command execution failed
22:50:48 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to integration-agent-docker-1038
22:50:48 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1797)
22:50:48 		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356)
22:50:48 		at hudson.remoting.Channel.call(Channel.java:1001)
22:50:48 		at hudson.Launcher$RemoteLauncher.launch(Launcher.java:1122)
22:50:48 		at hudson.Launcher$ProcStarter.start(Launcher.java:507)
22:50:48 		at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:143)
22:50:48 		at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:91)
22:50:48 		at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
[...]

https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71027/console
https://integration.wikimedia.org/ci/job/mediawiki-core-php72-phan-docker/61638/console

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		hashar	T292729 TAR_ENTRY_ERROR ENOSPC: no space left on device
Resolved		hashar	T252071 Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye
Resolved	BUG REPORT	hashar	T300214 'No such file or directory' CI failures in multiple repos

Event Timeline

Zabe created this task.Jan 26 2022, 11:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2022, 11:34 PM

Zabe renamed this task from 'No such file or directory' CI failures in multplie repos to 'No such file or directory' CI failures in multiple repos.Jan 26 2022, 11:34 PM

In T252071#7654453, @Stashbot wrote:

Mentioned in SAL (#wikimedia-releng) [2022-01-26T20:29:12Z] <hashar> Completed migration of integration-agent-docker-XXXX instances from Stretch to Bullseye - T252071

@hashar potentially fallout from T252071?

• brennen added projects: User-brennen, Release-Engineering-Team.Jan 26 2022, 11:38 PM

• brennen moved this task from Backlog to Radar on the User-brennen board.

Zabe triaged this task as High priority.Jan 26 2022, 11:46 PM

Looking at https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71027/consoleFull

23:15:39 ............................................................. 3416 / 4775 ( 71%)
23:15:39 ............................................................. 3477 / 4775 ( 72%)
23:15:40 ............................................................. 3538 / 4775 ( 74%)
23:15:53 ............................[3fce00e6b3fa0eac7d55b4ed] [no req]   MWException: MessagesEn.php is missing.
23:15:53 Backtrace:
23:15:53 from /workspace/src/includes/cache/localisation/LocalisationCache.php(498)
23:15:53 #0 /workspace/src/includes/cache/localisation/LocalisationCache.php(370): LocalisationCache->initLanguage(string)

The file has vanished? /workspace/src from inside the container is a volume mount from the host:

docker run --volume /srv/jenkins/workspace/workspace/wmf-quibble-core-vendor-mysql-php72-docker/src:/workspace/src

https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71032/console

23:25:12 INFO:zuul.Cloner:Updating origin remote in repo mediawiki/core to https://gerrit.wikimedia.org/r/mediawiki/core
23:25:15 INFO:zuul.Cloner:upstream repo has branch master
23:25:54 INFO:zuul.Cloner:Falling back to branch master
23:25:54 ERROR:zuul.Repo:Unable to initialize repo for /workspace/src
23:25:54 Traceback (most recent call last):
23:25:54   File "/usr/local/lib/python3.7/dist-packages/zuul/merger/merger.py", line 90, in createRepoObject
23:25:54     self._ensure_cloned()
23:25:54   File "/usr/local/lib/python3.7/dist-packages/zuul/merger/merger.py", line 63, in _ensure_cloned
23:25:54     git.Repo.clone_from(self.remote_url, self.local_path)
23:25:54   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 988, in clone_from
23:25:54     return cls._clone(git, url, to_path, GitCmdObjectDB, progress, **kwargs)
23:25:54   File "/usr/lib/python3/dist-packages/git/repo/base.py", line 939, in _clone
23:25:54     finalize_process(proc, stderr=stderr)
23:25:54   File "/usr/lib/python3/dist-packages/git/util.py", line 333, in finalize_process
23:25:54     proc.wait(**kwargs)
23:25:54   File "/usr/lib/python3/dist-packages/git/cmd.py", line 415, in wait
23:25:54     raise GitCommandError(self.args, status, errstr)
23:25:54 git.exc.GitCommandError: Cmd('git') failed due to: exit code(1)
23:25:54   cmdline: git clone -v https://gerrit.wikimedia.org/r/mediawiki/core /workspace/src
23:25:54   stderr: 'Cloning into '/workspace/src'...
23:25:54 /workspace/src/.git: No such file or directory
23:25:54 '

The process is cloning the repo but .git is never written ?

It is a bit late here (1am), the few ideas:

some other process or other job running on the machine happens to be deleting files from the disc
the Docker version we use is broken?
or maybe the underlying drive / Ceph has issue ?

If it happens solely on a specific CI agent, it can be put offline to prevent builds from executing.

It is neither Ceph nor Docker. Somehow multiple agents in Jenkins are connected to the same instance integration-agent-docker-1033:

hashar@integration-agent-docker-1033:~$ ps -u jenkins-deploy f
    PID TTY      STAT   TIME COMMAND
 588996 ?        S      0:00 sshd: jenkins-deploy@notty
 588999 ?        Ssl    0:00  \_ docker run --rm -i --volume /srv/jenkins/workspace/workspace/mediawiki-quibble-vendor-sqlite-php72-docker/cache:/cache --entrypoint=/usr/bin/rsync docker-reg
  37832 ?        S      0:43 sshd: jenkins-deploy@notty
  37873 ?        Ssl    3:26  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 589048 ?        Sl     0:00      \_ docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/workspace/mwext-node12-rundoc-docker:/workspace --security-opt seccom
 589050 ?        Z      0:00          \_ [bash] <defunct>
  37848 ?        S      1:04 sshd: jenkins-deploy@notty
  37884 ?        Ssl    4:10  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 585446 ?        Sl     0:00      \_ docker run --entrypoint=/usr/bin/find --user=root --volume /srv/jenkins/workspace/workspace/wmf-quibble-selenium-php72-docker:/workspace --security-opt s
 585448 ?        Z      0:00          \_ [bash] <defunct>
  37852 ?        S      0:45 sshd: jenkins-deploy@notty
  37874 ?        Ssl    3:38  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 589061 ?        Sl     0:00      \_ docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker:/workspace --sec
 589063 ?        Z      0:00          \_ [bash] <defunct>
  37838 ?        S      0:35 sshd: jenkins-deploy@notty
  37872 ?        Ssl    3:13  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
 586755 ?        Sl     0:00      \_ docker run --tmpfs /workspace/db:size=320M --volume /srv/jenkins/workspace/workspace/wmf-quibble-core-vendor-mysql-php72-docker/src:/workspace/src --volu
 586761 ?        Z      0:00          \_ [bash] <defunct>
   2002 ?        S      1:13 sshd: jenkins-deploy@notty
   2006 ?        Ssl    5:05  \_ /usr/bin/java -jar remoting.jar -workDir /srv/jenkins/workspace -jar-cache /srv/jenkins/workspace/remoting/jarCache
   1965 ?        Ss     0:40 /lib/systemd/systemd --user

The primary would trigger multiple builds of the same job (ex: wmf-quibble-core-vendor-mysql-php72-docker) on all those agents but each of them end up sharing the same directory on the machine: /srv/jenkins/workspace/workspace/quibble-vendor-mysql-php72-selenium-docker. Which is definitely messy, makes all builds unreliable.

The agent configuration were pointing to their WMCS instances IP. https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1027/jobConfigHistory/showDiffFiles?timestamp1=2022-01-26_18-53-32&timestamp2=2022-01-26_18-53-41

That config diff shows that the agent integration-agent-docker-1027 has been created with the IP 172.16.6.180 which is actually the agent 1026

I have created the instance by https://integration.wikimedia.org/ci/computer/new which offers to copy the configuration from another node. I guess that borrows the IP from the source node to the new node (as well as all other parameters which is handy).

My fringe 1am30 theory is that Jenkins immediately establish a connection before the final parameters are saved such as ... changing the IP to the proper one. That would be a bug for Jenkins and a big note added to https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup

Jdforrester-WMF mentioned this in T300224: Switch CI agent config from IPs to FQDN now it's possible.Jan 27 2022, 12:37 AM

Solved by team work over IRC #wikimedia-releng with bunch of nice suggestions, leads and ruling out other bits of the infra.

• bd808 added a parent task: T252071: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye.Jan 27 2022, 12:49 AM

• bd808 changed the subtype of this task from "Task" to "Bug Report".

Stashbot added a comment.Jan 27 2022, 9:05 AM

This comment was removed by hashar.

hashar closed this task as Resolved.Jan 27 2022, 9:07 AM

hashar claimed this task.

hashar reopened this task as Open.Jan 27 2022, 9:16 AM

Mentioned in SAL (#wikimedia-releng) [2022-01-27T16:00:28Z] <hashar> Pooling back agents 1035 1036 1037 1038 , they could not connect due to ssh host mismatch since yesterday they all got attached to instance 1033 and accepted that host key # T300214

The summary of the issue is that we had several Jenkins agents connected to the same instance and job ended up sharing the same workspace directory thus the workspaces were used by two different builds which caused the issues encountered.

When I have created the agents I have used the Copy From Existing Agent, my intuition is that maybe Jenkins would start connecting the new agent in the background albeit using the copied agent IP. I was not able to reproduce it this morning: the new agent does not connect until it is saved.

What I think happened is that I have created a fleet of agents and forgot to adjust their IP addresses, saved them and the four or five new agents ended up piling on each others. Or something along that way.

A follow up task is to use the instance FQDN rather than it is IP address in the agent configuration. That is T300224

Given I can not reproduce and everything is properly configured and connected now, I am closing this.

'No such file or directory' CI failures in multiple reposClosed, ResolvedPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

'No such file or directory' CI failures in multiple repos
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...