Page MenuHomePhabricator

Figure out how to garbage collect the npm cache
Closed, ResolvedPublic

Description

We have Jenkins agent instances overflowing the /srv. With MediaWiki selenium builds each taking 11G, 3 concurrent builds (33G) barely fit in the 36G partition.

I have rediscovered the npm cache is ever growing and currently:

Jenkins jobDisk size
wmf-quibble-selenium-php74-docker5.4 GB
wmf-quibble-selenium-php81-docker5.8 GB

That comes from looking at Castor:

hashar@integration-castor05:~$ du -m -s  /srv/castor/castor-mw-ext-and-skins/master/wmf-quibble*selenium*/npm/_cacache
5436	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php74-docker/npm/_cacache
5822	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php81-docker/npm/_cacache

Npm has a command to maintain the cache npm cache verify which states it garbage collects it. If I had it to the build https://gerrit.wikimedia.org/r/c/mediawiki/core/+/932048 that yields:

> selenium-test
> npm cache verify; echo $?; echo done; exit 1; wdio ./tests/selenium/wdio.conf.js

Cache verified and compressed (/cache/npm/_cacache)
Content verified: 8284 (425870788 bytes)
Content garbage-collected: 11580 (5099415924 bytes)
Index entries: 8284
Finished in 38.675s

So it verifed ~ 425 MB and garbage collected 5 GB.

Which looks like maybe we should have Quibble to garbage collect for us as a tear down command when the build is a success?

Event Timeline

Which looks like maybe we should have Quibble to garbage collect for us as a tear down command when the build is a success?

Seems reasonable to me. If it's possible to do after reporting the test results, to avoid adding more wait to the build time, that would be nice.

I am trying to understand how npm cache verify does the garbage collection. The cache is a content addressable one based on sha512 checksums and the implementation is in cacache, specifically:

https://github.com/npm/cacache/blob/2ae6d2d9dda028700e0bcfc7f0b5f8dc9d9c6e40/lib/verify.js#L92

That says:

Implements a naive mark-and-sweep tracing garbage collector.

The algorithm is basically as follows:
1. Read (and filter) all index entries ("pointers")
2. Mark each integrity value as "live"
3. Read entire filesystem tree in content-vX/ dir
4. If content is live, verify its checksum and delete it if it fails
5. If content is not marked as live, rm it.

Then I don't quite understand what is a live content. I am wondering what happens when the same cache is used for different repository (ie sharing the cache for npm test run against Flow and Wikibase might cause entries to be garabage collected by one repo while the materials are used by the other repo). It is not a concern for wmf-quibble-selenium jobs though since regardless of the repo triggering it, we npm install for each extensions so the cache is populated the same way.

Grepping through npm/_cacache/index-v5, each files has a list of keys stored and there are a bunch of duplicates:

$ grep -hR '"key"' .|cut -f2|cut -d, -f1|sort|uniq -c|sort -n|tail -n100
...
    107 {"key":"security-advisory:@storybook/addon-docs:ovulJcD/auWmGs0Knw/bAVchtapocROR0YFZV6nQSZaIkHvyOjCqEEYIZrFZIiXyRZVRSiKXIktiqf344bUiZg=="
    107 {"key":"security-advisory:@storybook/csf-tools:EylZbH4Z0h3z+CLCvxCUH1YCOJP54TWmt8RoXARpDyX5LUJbFkM7q520kvoPiLm833CrEyskYDgiy5/VhthQPA=="
    108 {"key":"security-advisory:@storybook/core-server:UizeKWrZgddn6Hki+tic6oTVQTcHSRDb8O+UDjN2WiHA697NLziwGSQW3ZlwiamNvs9f19PFjTT/9f5y9Ilh+Q=="
    149 {"key":"security-advisory:lighthouse:zoyxh4irMC2e6mQ60f/Us6CjiveoAwXfUGXC7bbneJWwJwqIcbiUMATDrahfLRJKs5zvzkiGaqdCS4aiS5kGyg=="
    152 {"key":"security-advisory:lighthouse:NEhHHNJzF3aS1a0jVQemdZD8OYxYo1JXmmsolbuKTrffoOQecCG2BSo+/0QRpEAgtEwGwq/MUMoXozDhGG0qHQ=="
    187 {"key":"security-advisory:@wdio/devtools-service:zszdkiw+MM+Sgh1xiyjn98S/wBzqrWnlZtgj99A4ith8EJCQvD/vbCPjpvsbKOhVjnc6ZbHK8qYDwN8LZ6jz+g=="
grep -R '"key":"security-advisory:@wdio/devtools-service'|cut -f2|jq .time|cut -b1-10| xargs -n1 -I{} date --date='@{}' +'%Y-%m-%d %T'|sort -gr
...
2022-11-18 17:41:07
2022-11-18 17:37:49
2022-11-18 17:37:49
2022-11-18 05:14:00
2022-11-17 07:18:27
2022-11-16 05:16:49
2022-11-16 04:12:20
2022-11-16 04:09:04
2022-11-16 04:09:04
2022-11-15 09:26:59
2022-11-14 22:12:19
2022-11-14 22:09:08
2022-11-14 22:09:08

So I guess they are old entries that can be garbage collected.

Change 932195 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: gc quibble jobs npm cache before saving it

https://gerrit.wikimedia.org/r/932195

I went to try with https://gerrit.wikimedia.org/r/mediawiki/extensions/TimedMediaHandler doing:

  • a full install
  • listing cache
  • installing an extra dependency (qunit)
  • revert the changes to package.json / package-lock.json to remove qunit
  • npm ci && npm cache verify

The qunit package is still in the cache.

So even if the job cache is shared between different repositories, it looks like entries are kept.

I am not entirely sure what npm cache verify does, but it surely drops obsolete entries but retain entries that can still potentially be used.

Mentioned in SAL (#wikimedia-releng) [2023-06-26T13:11:05Z] <hashar> Updated wmf-quibble* jobs to run npm cache verify in order to garbage collect the npm cache before saving it. https://gerrit.wikimedia.org/r/c/integration/config/+/932195 | T340092

I have manually triggered builds for the two largest jobs using ZUUL_PIPELINE= postmerge:

The npm cache sizes on integration-castor05:

$ du -s -m /srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium*/npm/
5656	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php74-docker/npm/
6041	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php81-docker/npm/

For the php74 job:

Cache verified and compressed (/cache/npm/_cacache)
Content verified: 9073 (471 940 339 bytes)
Content garbage-collected: 11802 (5 246 555 479 bytes)  # <----- 5.2 GBytes!
Index entries: 9073
Finished in 30.952s

For the php81 job:

Cache verified and compressed (/cache/npm/_cacache)
Content verified: 6409 (376 527 973 bytes)
Content garbage-collected: 1768 (743 234 191 bytes)
Index entries: 6409
Finished in 24.843s

But they haven't shrunk on integration-castor05. Which is because I triggered the build against mediawiki/core (which has its own cache) rather than extension/skin.

Building against mediawiki/extensions/WikimediaMessages:

For the php74:

Cache verified and compressed (/cache/npm/_cacache)
Content verified: 9079 (471 952 157 bytes)
Content garbage-collected: 11808 (5 262 263 872 bytes)
Index entries: 9079
Finished in 31.946s

For the php81 build:

Cache verified and compressed (/cache/npm/_cacache)
Content verified: 9092 (472 137 756 bytes)
Content garbage-collected: 12580 (5 662 779 556 bytes)
Index entries: 9092
Finished in 30.932s

And on disk:

612	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php74-docker/npm/
614	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php81-docker/npm/

Change 932195 merged by jenkins-bot:

[integration/config@master] jjb: gc quibble jobs npm cache before saving it

https://gerrit.wikimedia.org/r/932195

hashar claimed this task.

Some other large caches:

...
1563	castor-mw-ext-and-skins/master/mediawiki-quibble-apitests-composer-php74-docker/npm
1682	mediawiki-core/master/wmf-quibble-selenium-php72-docker/npm
1950	castor-mw-ext-and-skins/master/mediawiki-quibble-apitests-vendor-php74-docker/npm
2119	mediawiki-vendor/master/wmf-quibble-selenium-php74-docker/npm
2119	mediawiki-vendor/master/wmf-quibble-selenium-php81-docker/npm

They will eventually be garbage collected as build complete in postmerge or gate-and-submit*.

After a couple months:

integration-castor05
$ du -s -m /srv/castor/*/master/*/npm|sort -n|tail -n10
390	/srv/castor/castor-mw-ext-and-skins/master/mediawiki-quibble-apitests-vendor-php74-docker/npm
434	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php81-docker/npm
434	/srv/castor/mediawiki-core/master/wmf-quibble-selenium-php74-docker/npm
435	/srv/castor/mediawiki-core/master/wmf-quibble-selenium-php81-docker/npm
441	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php74-docker/npm
464	/srv/castor/castor-mw-ext-and-skins/master/quibble-donationinterface-REL1_35-php73-docker/npm
471	/srv/castor/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php74-selenium-docker/npm
617	/srv/castor/mediawiki-vendor/master/wmf-quibble-selenium-php81-docker/npm
618	/srv/castor/mediawiki-vendor/master/wmf-quibble-selenium-php74-docker/npm
1563	/srv/castor/castor-mw-ext-and-skins/master/mediawiki-quibble-apitests-composer-php74-docker/npm

Looks like the npm caches are well contained. The last one is for the job mediawiki-quibble-apitests-composer-php74-docker which although it is defined in Jenkins, is no more triggered since https://gerrit.wikimedia.org/r/c/integration/config/+/887340 , I have thus manually erased that cache.

Some more time later:

309	/srv/castor/mediawiki-services-parsoid/master/quibble-composer-mysql-php80-docker/npm
309	/srv/castor/mediawiki-services-parsoid/master/quibble-composer-mysql-php81-docker/npm
502	/srv/castor/castor-mw-ext-and-skins/master/mediawiki-quibble-apitests-vendor-php74-docker/npm
527	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php81-docker/npm
594	/srv/castor/castor-mw-ext-and-skins/master/quibble-vendor-mysql-php74-selenium-docker/npm
601	/srv/castor/castor-mw-ext-and-skins/master/wmf-quibble-selenium-php74-docker/npm
661	/srv/castor/mediawiki-core/master/wmf-quibble-selenium-php81-docker/npm
796	/srv/castor/mediawiki-core/master/wmf-quibble-selenium-php74-docker/npm
937	/srv/castor/mediawiki-vendor/master/wmf-quibble-selenium-php81-docker/npm
938	/srv/castor/mediawiki-vendor/master/wmf-quibble-selenium-php74-docker/npm

I guess the caches are more or less contained now :)