Page MenuHomePhabricator

Fresh lacks a persistent npm cache across multiple sessions
Open, Needs TriagePublic

Description

While inside a Fresh container, I could go doing several npm ci commands which wipe the node_modules and redownload dependencies. That could benefit a cache to be setup inside the container to avoid redownloading large dependencies.

I noticed that on performance/fresnel which depends on @mdn/browser-compat-data which is apparently 12 MBytes.

Details

Related Changes in Gerrit:

Event Timeline

It inherits XDG_CACHE_HOME from the CI images and npm does use /cache as an incontainer cache. Of course the cache vanishes when the container is terminated, so maybe the cache could be stored inside the current working directory (eg .fresh-cache ).

That could be a shared cache at XDG_CACHE_HOME/fresh / $HOME/.cache/fresh, then I guess it could be poisoned? Else it can be namespaced by the current working directory.

Based on the title and task description, I believe you're asking for an in-container cache. However, as far as I know, we allow npm to cache within a given session already, such that repeat commands etc do not need to re-download things.

I generally encourage long-running sessions, treating Fresh similar to how you might treat ssh to vagrant or ssh to production. This way makes it easy to iterate on a patch and periodically re-running a lint or unit test etc. It also avoids accidentally running commands outside the container as you won't mix the same Terminal tab between Fresh and non-Fresh. E.g. when working on repo X, I'll have two tabs one for Fresh and one for git/misc stuff.

When working on a project, I keep that open for quite a while, sometimes multiple days. It's pretty light/cheap in practice. I use npm install locally so that the current directory functions as a persistent cache between sessions as well. In current versions of npm, npm-install behaves very close to npm-ci with the exception that npm-ci distrusts the current working directory and re-populates it from central cache or network. I'm not sure that this is needed for local development, especially if we are considering storing the cache in that same working directory.

That could be a shared cache at XDG_CACHE_HOME/fresh / $HOME/.cache/fresh, then I guess it could be poisoned? Else it can be namespaced by the current working directory.

Yeah, I've intentionally not shared it in the past as that makes it no longer an isolated environment where I can trust it to be able to run on a untrusted code and not affect other projects I work on or contribute to. Namespacing the cache would bring up issues of gargage collection, and corruption/concurrent writes. I've personally did not that that worth the trade-off to invest in. (And I'd rather not introduce bugs or disk space regressions by releasing a version without those mitigations in place). So the speed trade-off has felt to me like the worthwhile choice, especially when combined with long-running sessions and using npm-install, you basically get a persistent per-project cache for free already.

There's some best practices / documentation issue in here for sure though. It might be worth polling how people use Fresh. E.g. do they keep long-running sesions like me? If not, why not? Would they benefit from it? Are there reasons not to? Do they use npm-ci? Why/Why not? Would they benefit from using npm-install instead?

That could be a shared cache at XDG_CACHE_HOME/fresh / $HOME/.cache/fresh, then I guess it could be poisoned?

+1 from me for this idea – see T383325#11388856.

It might be worth polling how people use Fresh. E.g. do they keep long-running sesions like me? If not, why not? Would they benefit from it? Are there reasons not to? Do they use npm-ci? Why/Why not? Would they benefit from using npm-install instead?

I have a setup with very short-running sessions (one Fresh per command), because I find that easier to work with – it’s pretty rare for me to use an interactive Fresh terminal, because there I’ll be missing my usual shell prompt and aliases, and also I’ll have to switch tmux windows all the time if I want to do “write” git operations (like git add -p or git commit).

Just a drive-by idea: an optional, durable cache could be added as a volume rather than a bind mount to local disk. Containers could be named with a pattern (fresh_npm_cache...) that made them relatively easy to discover with {docker,podman} volume ls. When the option has been used fresh could emit a notice on termination reminding the user that there is cache just hanging out on disk now. With some thought about the cli api it seems like it would also be possible to keep multiple caches (if someone can explain a use case for that) and pick the particular one you want to create/reuse at startup.

Change #1214567 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[fresh@master] Bind-mount $XDG_CACHE_HOME/fresh-node as /cache

https://gerrit.wikimedia.org/r/1214567

Just a drive-by idea: an optional, durable cache could be added as a volume rather than a bind mount to local disk. Containers could be named with a pattern (fresh_npm_cache...) that made them relatively easy to discover with {docker,podman} volume ls.

Sounds reasonable to me, but I don’t know how volumes work, so I just uploaded a change to bind-mount a host directory :) it seems to work in my limited local testing (though I haven’t installed it as my “real” Fresh yet).

When the option has been used fresh could emit a notice on termination reminding the user that there is cache just hanging out on disk now. With some thought about the cli api it seems like it would also be possible to keep multiple caches (if someone can explain a use case for that) and pick the particular one you want to create/reuse at startup.

FWIW, what I had in mind was no option / CLI API at all. I think this should be enabled by default, so that developers in a Cypress-using extension can run fresh-node -- npm install and then fresh-node -- npm run cypress:run (or similar) and it will Just Work (Cypress having downloaded itself into the shared cache directory in a postinstall script if necessary). Though of course that doesn’t rule out options to change the cache “name” or to disable the cache.

Krinkle renamed this task from Fresh lacks an in container npm cache to Fresh lacks a persistent npm cache across multiple sessions.Wed, Jan 21, 12:23 AM

[…]
FWIW, what I had in mind was no option / CLI API at all. I think this should be enabled by default, so that developers in a Cypress-using extension can run fresh-node -- npm install and then fresh-node -- npm run cypress:run (or similar) and it will Just Work (Cypress having downloaded itself into the shared cache directory in a postinstall script if necessary). Though of course that doesn’t rule out options to change the cache “name” or to disable the cache.

I default workflow that I recommend, which is reflected in docs for MediaWiki-Docker (docs) and Fresh (docs), is to start a session and then work in that session.

There are a few reasons for this:

  • Minimal learning curve for these tools to things specific to it, not to Avoid fragmented learning by not wrapping or forking an entire ecosystem. There should never be a question of how to do X in Fresh or X in MediaWiki-Docker, other than the installation and configuration of those tools themselves. You simply look for how you do things with MediaWiki, Node.js, npm, etc in general. In the past, I've seen significant loss of productivity and confusion among beginners due to MediaWiki-Vagrant, MediaWiki-Docker-Dev, and MediaWiki-Docker-Quickstart by requiring everything everywhere to be customised to it (How do install this extension? How do I change this setting? How do I run PHPUnit? How do I generate PHPUnit code coverage? How do I run phpcs? How do I run ESLint/QUnit/Wdio?) Those use cases have existing manuals already, and those should just work. An experienced and patient person may mentally try to map phpunit includes/ResourceLoader to the aburd word soup that is docker compose exec mediawiki bash composer phpunit -- includes/ResourceLoader, but this is imho not a reasonable thing for anyone to try to contextualise and remember. It is an unreasonable amount of cognitive context. The below points only make this worse.
  • Naturally correct. Minimise mistakes by mentally separating this space into a session that occupies one of your terminal tabs. That way you're less likely to run an npm or php command outside their respective containers which either fails (and cause confusion, frustration, and waste time) or has unmitigated security implications.
  • Avoid running into a wall. These shortcuts work for the common case, but eventually fall down when something somewhere requires you to interact with something else in the container, and then you're not there and have to start a session anyway. I'd rather take tiny step to have people start a session, and then be generally fine, then to have run into a wall later. This includes state sharing and debugging between commands.
  • Delay from command overhead. There is a notable delay with these shortcuts from starting a new container or new session on every command.
  • Delay from lack of caching. The subject of this task is caching, which isn't a problem if you run commands in a session.

The original task description here suggests persisting the cache for performance reasons, but if I understand your commend correctly, your motivation is not speed but correctness because Cypress is broken currently. This is a bug in Cypress. If I recall correctly, Puppeteer fixed a similar bug years ago and no longer depends on writing files outside the project directory. However, even before that was fixed upstreamed, we supported it by setting its environment variable. Perhaps Cypress supports something similar?

In any case, the potential for such bugs is another reason to use sessions rather than ephemeral instances. If I recall correctly, the Puppeteer bug would not have been mitigated by this proposal, because it wrote Chromium to a temporary file, not the npm cache directory, with its location written to a config file in node_modules in the current project folder.

See also T253924 where the topic of one-off commands came up. I ended up merging that feature, because there were plenty of commands that don't need state. The prime example being something like npm run lint or node_modules/.bin/eslint by an IDE automation or bash aliases like alias nlint='fresh-node -- npm run lint' where perhaps you were already invoking npm programmatically, but want to have it run those in Fresh instead. That makes perfect sense, and the use cases covered by these, I believe, all work correctly today. This is similar to docker-compose-exec, which, despite not recommending by default for regular use, I do find useful in programmatic use. However, I stand by that for ad-hoc use on the command line, you're better off using a session.

As for why caches aren't shared, this was not an oversight but an intentional choice. I'm happy to revisit that decision, but I do want us to think through the implications:

  • Cache corruptions. As we see regularly in WMF CI and its use of Castor, when we have multiple instances sharing the same cache, these concurent writers can corrupt the cache directory. This introduces a new and novel problem that poeple are not familiar with in local development. This is not something Fresh users should have to understand or deal with, and thus why I suggest limiting this to an option rather than the default. Especially if this essentially to workaround a Cypress-specific defect.
  • Cache growth. How do we limit the cache size when it is 1) shared across unrelated Fresh instances (thus we can't instruct npm to prune based on current project dependencies and their versions) and 2) persisted outside Docker when no instance is running?
  • Cache integrity. Sharing the cache with the host npm cache is out of the question for security reasons. A dedicated cache might work, although that still leaves a potential for "spooky action at a distance" taking people by surprise. I think when I launch fresh in project X I have a reasonable expectation that random scripts I ran a month ago for project Y cannot affect what I am doing now for project X. Sharing the cache compromises that expectation. In theory, npm-cache is a basic HTTP cache that verifies integrity at runtime, so npm-cache specifically might be safe to share. However, I see that the current patch already proposes a much wider XDG_CACHE_HOME sharing. In part because the Cypress bug requires it, because Cypress presumably places binaries there outside the npm-cache, and these are (I presume) not verified by a signature or content hash at runtime.

I default workflow that I recommend, which is reflected in docs for MediaWiki-Docker (docs) and Fresh (docs), is to start a session and then work in that session.

I’m aware of this and already wrote in T349276#11388910 that that’s not my workflow. I don’t think insisting on our respective workflows at each other will get us anywhere – I’d rather have a cache setup that works with either workflow.

To recap the Cypress situation (and reduce the number of assumptions going around), my understanding is this:

  • By default, Cypress stores its application (Electron binary etc.) in the $XDG_CACHE_DIR; this can be overridden using $CYPRESS_CACHE_FOLDER. The application is installed there in a post-install script during npm install, or manually afterwards.
    • The default works fine in CI, because the $XDG_CACHE_DIR is persisted across CI runs – thus, repositories that stick to the default don’t need to re-download Cypress each time, and should in theory even be able to survive an outage of the Cypress CDN.
    • However, the default doesn’t work well in Fresh, because the cache directory in the container is ephemeral. This is especially true if you use short-lived Fresh containers (fresh-node -- npm install and fresh-node -- npm test don’t share a cache, so the Cypress command in the test script will be missing the application); but it’s still true in long-lived Fresh session, because then you still need to know to run npm install (or npx cypress install) at the beginning of the session.
  • For this reason, several repositories include CYPRESS_CACHE_FOLDER=./cypress/.cache in their .npmrc and/or package.json files (codesearch), so that the Cypress application will be stored in a more persistent place where it will survive across Cypress invocations.
    • This is awful in CI, because it means that each CI run will re-download Cypress multiple times (once per Cypress-using extension). There’s no sharing of the cache across extensions or across CI runs. And if the CDN goes down, our CI breaks.
    • But it works better for local development.
    • Note that npm is deprecating support for arbitrary environment variables in .npmrc, so once that’s removed, we’re probably looking at a future where Cypress is downloaded twice per test runs – first during npm install into $XDG_CACHE_DIR (because we can no longer control $CYPRESS_CACHE_FOLDER), and then again into $CYPRESS_CACHE_FOLDER during a custom post-install script.

Cache corruptions. As we see regularly in WMF CI and its use of Castor, when we have multiple instances sharing the same cache, these concurent writers can corrupt the cache directory. This introduces a new and novel problem that poeple are not familiar with in local development.

How so? The normal npm cache is also shared machine-wide. If we take “no Fresh” as the “familiar” setup that we’re comparing to, then people might already be getting these errors locally, and adding a persistent cache to Fresh doesn’t add a new and novel problem.


I think for the Cypress situation, a per-project cache would also work, and might avoid some of the issues you raised. (As I see it, the most important thing is that we find some solution that will allow extensions to remove their custom $CYPRESS_CACHE_FOLDER configuration, so that the shared cache dir will start to work in CI.) Of course, a per-project persistent cache also adds some problems compared to a single machine-wide Fresh cache:

  • Location. If we put it in the project dir, we have to find a place that Git won’t try to track. node_modules/.cache? A Fresh-specific directory that we magically add to .git/info/exclude? On the other hand, if we put it below $XDG_CACHE_DIR, we have to split it by project dir somehow. Hash the project dir path (unreadable)? Copy its path, possibly escaped in a systemd-path-like fashion (might run into path length limits)?
  • Efficiency. A per-project cache is going to be less efficient than a shared cache – it will use more bandwidth (because multiple projects will download the same data, e.g. npm packages or Cypress applications) and more storage space (because they’ll store that data in their separate cache dirs).