Page MenuHomePhabricator

Zuul-cloner should use git-clean to reset workspace
Closed, ResolvedPublic

Description

I noticed that the mediawiki-core-npm job (uses zuul-clone, runs in labs) does not clear the working space. The working directory is preserved by Jenkins (including the directories left behind by zuul-cloner), and zuul-cloner itself also does nothing to reset the working space.

This is causing a wide variety of issues:

  • node_modules persists between jobs. This is causing old versions to stay in memory even after a newer release was made upstream. It is also causing jobs that modify node_modules to affect future jobs. And no longer used modules (e.g. grunt-jscs-checker was renamed to grunt-jscs) to persist indefinitely in node_modules on the slaves.
  • Random files in random directories stay behind (e.g. files that used to be in version control in a past version but have been deleted from the repository).
  • Pattern detection is matching unexpected and invalid files (e.g. resources/src/**/*.js is now matching files that may be invalid or aren't supposed to be there anymore).
  • Files from future versions existing in tests against older branches.

The -npm job is declared with git-scm, (jenkins) wipe-workspace: false, (git) clean: true.

This is a regression.

Related Objects

Event Timeline

Krinkle assigned this task to hashar.
Krinkle raised the priority of this task from to Unbreak Now!.
Krinkle updated the task description. (Show Details)
Krinkle changed Security from none to None.
Krinkle added subscribers: Krinkle, hashar.

When preparing a repository, Zuul cloner does a checkout and force the working tree to match (discarding uncommitted changes) http://gitpython.readthedocs.org/en/stable/reference.html#git.refs.head.HEAD.reset

The reason is that for the mediawiki integration job, we would get core + vendor + extensions. On a first run, we clone all of them, on a second run, the workspace already contains the cloned repo so we just check them out. Using git clean for mediawiki/core would discard the previously cloned extensions (under /extensions/ ) which would force it to reclone all of them.

Can we just use npm prune to ensure we have an appropriate /node_module?

node_modules is just one of the many things that may stay behind. As I mentioned above, this also causes other untracked files to stay behind. Especially when switching between branches. Sooner or later causes test failures like we had last week (requiring one to manually purge workspaces on each slave). Or worse (and just as likely) it may prevent tests from failing that should be failing. Masking errors.

It also caused tests to fail due to js/css files from older branches to show up in newer branches that either used to be in an ignore list. Or don't pass newer rules (e.g. foo/bar.js exists in 1.22, deleted in 1.23, in 1.24 we added a stricter coding style rule, foo/bar.js naturally doesn't validate obviously shouldn't show up in a 1.24 build).

Instead of making an exception for the regular case (by hacking around npm, which is just one of many things that can and will go wrong), maybe try to preserve or optimise those extension clones somehow?

It seems wasteful anyway to have so many clones all over the place. On each Jenkins slave we have between 1 and 3 (concurrency) workspaces for each job, in each of them a clone for the main repository and its upstream/downstream dependencies. Which means an average slave will have 100s of clones of mediawiki-core and about a dozen clones of each extension.

Maybe we can do something like we do on prod slaves: Clone from local drive using hard links (instead of from Gerrit each time). Except without Gerrit replication of course (unless that's easy to do to labs). We would manually add any required repos to a local git cache and updating that first, then do a fast clone from there before fetching from Zuul?

Either way, I won't defend the need for clean workspaces. That should be obvious. It is already a massive sacrifice to preserve .git between builds (and use git-clean instead of Jenkins' rm -rf solution). Not even git-cleaning is an unacceptable regression. This was not an issue before Zuul-cloner.

Other work arounds I can imagine: Inside Zuul-cloner, run git clean -dffx in each extension, and then run it dry in core, filter out extensions from the list and manually rm -rf those.

Or: Don't nest clones. Instead clone them next to each other and put symlinks in place.

As for performance, yes. We can't re-clone all of MediaWiki core each extension build. But how did we do this before Zuul-cloner? And how are we going to do this in disposable sandboxes? We need a different solution.

I would use npm prune as a workaround for the npm job. I lack spare cycles to investigate the impact of having zuul-cloner to run git clean -xqdf (or -ff).

@hashar That is not an option.

Both composer and npm have proven in the past that they do not support long-lived local directories. Things like trying to cover every possible edge case with different package versions over time is too much work. It is very likely that upgrading npm or composer (while keeping vendor and node_modules from previous builds) will result in left-behind files, cache conflict, or incompatible directory trees.

The battle field left behind after such failure is a pain to debug. For developers locally or production the solution is easy: "Just rm -rf it and re-run 'foo install'". And as such that is also the common response on bug reports about these. It's nearly impossible maintain package managers like these with infinite look-ahead and look-behind support, arbitrarily skipping versions etc. These kind of environments should use disposable sandboxes or at least properly clear the workspace.

Also as pointed out, this is not limited to composer or npm. Git itself is also terrible at it. And builds themselves also produc additional files.

In a perfect world these files are always cleaned up in teardown. But that's not the case in reality. We're dealing with draft commits and continuously iterated alpha software. The CI environment can not demand software to be bug-free as this very environment is the testing ground.

I don't want to waste time and time again debugging problems caused by bugs in upstream software or minor issues in unit tests, or projects switching to using a submodule having to notify us, or me clearing a flooded workspace or tmp directory. Those are valid bugs and they should be addressed upstream/downstream, but they should not result in a Jenkins slave becoming unstable. That is not necessary and easily prevented.

Zuul-cloner is too buggy at the moment. It should never have been merged upstream or deployed here whilst lacking essential git and CI aspects such as loading submodules and clearing the workspace.

I acknowledged that Zuul cloner not cleaning the workspace can be problematic. I am not denying it, I simply lack cycles to add support for clearing the repositories and investigate the side effects that comes with it.

The example issue in this task is with node_module congaing extra packages from a previous run, which is exactly what npm prune is meant for and imho an acceptable workaround until zuul-cloner is taught to git clean.

gerritbot added a subscriber: gerritbot.

Change 184838 had a related patch set uploaded (by Krinkle):
Ensure the repository configuration lock is released

https://gerrit.wikimedia.org/r/184838

Patch-For-Review

Change 185686 had a related patch set uploaded (by Ori.livneh):
Reset & clean workspace repo for better hygiene

https://gerrit.wikimedia.org/r/185686

Patch-For-Review

Krinkle added a subscriber: ori.

Change 188291 had a related patch set uploaded (by Krinkle):
npm-set-env.sh: Purge node_modules

https://gerrit.wikimedia.org/r/188291

Patch-For-Review

Change 188291 merged by jenkins-bot:
npm-set-env.sh: Purge node_modules

https://gerrit.wikimedia.org/r/188291

Change 185686 abandoned by Hashar:
Reset & clean workspace repo for better hygiene

Reason:
See upstream change https://review.openstack.org/148121

https://gerrit.wikimedia.org/r/185686

greg lowered the priority of this task from Unbreak Now! to High.Mar 27 2015, 4:57 PM
greg added a subscriber: greg.

This is unbreak now! but no movement in a long time.... what should happen next here?

Probably backport https://review.openstack.org/148121, or upgrade our Zuul to a newer upstream.

Krinkle lowered the priority of this task from High to Lowest.Apr 7 2015, 2:07 PM
Krinkle raised the priority of this task from Lowest to High.
Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.
Krinkle changed the task status from Open to Stalled.Apr 16 2015, 3:11 AM
Krinkle removed a project: Patch-For-Review.
Krinkle removed a subscriber: gerritbot.
Krinkle renamed this task from Zuul-cloner forgets to clear workspace to Zuul-cloner should use git-clean to reset workspace.Apr 20 2015, 11:02 PM
ori removed ori as the assignee of this task.Jun 19 2015, 5:32 PM
hashar claimed this task.

I think I got Ori patch ( https://review.openstack.org/#/c/148121/ Reset & clean workspace repo for better hygiene
) cherry picked ages ago.

In below git branch --contains:

The patch has been merged upstream (origin/master)
It is included in our Precise and Trusty Debian packages (wikimedia/debian/{precise,trusty}-wikimedia)

$ git branch -r --contains 1403a9ee056eb4583566dbb58191d1d2556c18b3
  origin/HEAD -> origin/master
  origin/master
  wikimedia/debian/precise-wikimedia
  wikimedia/debian/trusty-wikimedia
  wikimedia/labs-tox-deployment
  wikimedia/patch-queue/debian/precise-wikimedia
  wikimedia/upstream