Page MenuHomePhabricator

Castor: mediawiki-core-qunit-jessie node_modules cache ineffective
Closed, ResolvedPublic

Description

"Castor" caches the node_modules directory after each gate-and-submit pipeline job. And it loads it at the beginning of all pipeline's jobs.

However, the job always runs:

00:00:33.100 + rm -rf node_modules
00:00:33.101 + npm install

(From https://github.com/wikimedia/integration-config/blob/561a760837/jjb/macro.yaml#L51 and https://github.com/wikimedia/integration-config/blob/561a760837/jjb/macro.yaml#L151)

As such, the cache is ignored. The original reason for this probably doesn't apply anymore (it's mostly because the directory was retained between runs because the workspace wasn't cleaned up, afaik we do this everywhere now and/or run on Nodepool where each instances is destroyed afterward anyway).

However, even if we make sure that 4e9a7314d2b5c21899f2c329f7e5748c8b55abef is no longer a problem (node_modules preserved between npm-setup-oid jobs that point node_modules to a deploy.git-style repo), there may be other issues we need to think about. For example:

  1. We don't clear the cache periodically (it'd grow indefinitely as unused dependencies will linger).
    • May cause unexpected side-effects in case of npm modules that use autodiscovery to change their behaviour - it would find a module that is no longer there on fresh install.
    • May cause code to pass even if a package is wrongly removed from package.json.
  2. We don't clear the cache after upgrading npm in CI. In theory the cache consumer in npm is meant to be backward-compatible, but I don't know for how many releases back that support goes or how well tested this is by upstream.
  3. Zuul will abort Jenkins jobs when their result is no longer relevant (e.g. patch-set is updated, or dependant commit failed). This may abort an npm-install process prematurely without it being able to leave behind a clean state.
    • Shouldn't be an issue since we only run castor-save after a completed and successful build.

Next steps:

  • Verify 4e9a7314d2b5c is no longer a problem.
  • Draft a change that changes rm -rf node_modules; npm install to npm prune && npm install (should address point 1).

Event Timeline

Krinkle added a subscriber: hashar.

On IRC, @hashar reminded me that we aren't using castor to cache the $WORKSPACE/node_modules directory. Instead, we use castor to save the NPM cache directory at $HOME/.npm. This means none of the potential corruption risks apply. It also means that contrary to my assumption, we are in fact benefitting from the cache.

At runtime, npm just validates the cache and copies the appropiate packages and versions to the build directory. Which works fine and is very safe.

hashar triaged this task as Medium priority.

That came up when looking at oojs/ui npm job ( T155483 ). It takes roughly 12 minutes, 5:30 minutes being the npm install step. The slowness reason is that although npm benefit from ~/.npm/ cache, it still has to compile native modules every single time.

For ruby / bundler, we cache the vendor directory which is the equivalent of node_modules. Ie ruby jobs have the compiled modules and thus the install step ends up lightning fast when dependencies have not changed.

A note: npm/node-gyp do not support caching the compiled module:

Interesting one is node-pre-gyp which list a few more.

So most probably we could namespace the cache by using the package version and node ABI version among others. Example:

$ node -p 'process.versions.modules;'
51
$

I am willing to try caching node_modules, as a trial on the few repositories that have long install time (eg: oojs/ui). Ideally with a way for developers to clear the cache on demand.

I am willing to try caching node_modules, as a trial on the few repositories that have long install time (eg: oojs/ui). Ideally with a way for developers to clear the cache on demand.

Maybe whenever a recheck is issued from gerrit we should rm -rf node_modules and get rid of our cache? Is that sensible and possible?

From the description:

• Draft a change that changes rm -rf node_modules; npm install to npm prune && npm install (should address point 1).

Would it be better to run npm prune && npm update, that would install missing dependencies and update the existing packages to the extent the semver range is satisfied.

From the description:

• Draft a change that changes rm -rf node_modules; npm install to npm prune && npm install (should address point 1).

Would it be better to run npm prune && npm update, that would install missing dependencies and update the existing packages to the extent the semver range is satisfied.

Yes, we should. When caching at the server level, this isn't needed as NPM will do that no matter what. But when (effectively) preserving part of the workspace, npm install will do minimal effort since developers often run this every time before npm test or even whenever changing one of the source files on disk when working. It shouldn't pull in the latest expansion of sem ver every time (which requires a brief network roundtrip, too). However in CI we should definitely switch from npm install to npm update if we cache node_modules. Otherwise we might consider a commit as passing when in fact a fresh install would fail.

@hashar I am curious to know if such a change could be implemented on per-repository basis.

On my usual working directory of OOUI, on a MacBook Air with a 4Mbps connection:

CommandTime
rm -rf node_modules20.529
npm install2:43.07
npm update12.305

The first time I ran npm update yesterday it took quite some time. I forgot to time it but it was surely upwards of a minute. I suppose if you run it regularly, as we will, the amount of time it takes will reduce. npm prune took only a second because I had just run all these commands.

hashar added a subscriber: Jdrewniak.

I found a good candidate: wikimedia/portals @Jdrewniak filled T152386 to get node_modules cached. So I guess we can try to switch it to npm prune && npm update.

Change 346152 had a related patch set uploaded (by Hashar):
[integration/config@master] Cache node_modules

https://gerrit.wikimedia.org/r/346152

Change 395610 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] (WIP) Cache node_modules between runs

https://gerrit.wikimedia.org/r/395610

Change 346152 abandoned by Hashar:
castor: cache binaries on a per job basis

Reason:
Castor was badly conceived requiring to mention each of the paths we want to save.

On Docker containers, the caches are all pointing at /cache thanks to XDG_CACHE_CONFIG or at least can be easily made to point there.

So the idea is to:

npm install
cp node_modules /cache/node_modules
npm test

Then if tests pass, castor would save the node_modules.

On the next build, the node_modules is restored and "npm prune && npm install" bring it up to date with package.json changes.

A basic implementation is https://gerrit.wikimedia.org/r/#/c/395610/

https://gerrit.wikimedia.org/r/346152

I am migrating the npm jobs to Docker containers. https://gerrit.wikimedia.org/r/#/c/395610/ would potentially save and restore node_modules and bring it up to date via a 'npm prune && npm install'.

Krinkle claimed this task.

Castor strategy for npm was changed from caching node_modules in workspace, to instead caching the .npm cache directory (normally in HOME).

This working as expected. Aside from being more stable, it is also good for another reason, which is that soon we'll use npm ci in more repos (based on the lock file) which always clears node_modules as part of the install process, so that wouldn't have worked anyway.

Change 395610 abandoned by Hashar:
(WIP) Cache node_modules between runs

Reason:
I have abandoned the idea indeed and the more recent npm is way faster at least.

https://gerrit.wikimedia.org/r/395610