Page MenuHomePhabricator

[tracking] Disposable VMs need a cache for package managers
Closed, ResolvedPublic

Description

@dduvall raised the concern that the one off instances would lack caching. Currently for pip/npm/bundle we point the caches to the jenkins-bot and it is being reused by builds. With Nodepool, there is no cache which would force us to download from the package managers.

Some ideas:

  • a shared NFS directory, but the cache can easily be poisoned, corrupted.
  • observes dependencies being used, collect them and build a cache daily. Share it via a read-only NFS dir. A couple problems: that needs work to collect the deps, package managers would attempt to write to the read-only cache dir whenever there is a dep which is not fulfilled.
  • Use a proxy. Beside setting up the proxy itself, the jobs would need http_proxy / https_proxy to be set. The compiled packages would most likely not be cached though :-/

Related Objects

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, dduvall, zeljkofilipin.

Discussing with OpenStack people, they have some jobs downloading Linux distributions and are looking for a cache/mirroring solution. Their RFC (== spec) is at https://review.openstack.org/#/c/194477/ .

Discussing with OpenStack people, they have some jobs downloading Linux distributions and are looking for a cache/mirroring solution. Their RFC (== spec) is at https://review.openstack.org/#/c/194477/ .

FWICT, that's a proposal around caching nodepool images, not packages or other dependencies that various jobs require. In my mind, they are separate problems with only marginal overlap: CI base images will be almost completely homogenous in our case while dependent system/gem/pip/composer/npm packages vary widely from job to job.

Travis implements a user-/job-specific system that restores and caches specific directories before and after each job executes, storing the data in S3. We could implement something similar but it would require a reliable central store, and the whole setup seems a little 'brute force' to me.

Another possibility that @hashar and I discussed was to provide separate read-only caches for the specific packaging systems—read-only to protect against the corruption that might occur during concurrent updates. Each cache would augment the package manager's read-write destination within the workspace and be periodically updated to include new packages. The update process could be scheduled or triggered at the end of each job as long as we can reliably audit which packages were installed locally during execution.

This was discussed in https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-10-06-13.59.html see point 5.

Travis implements a user-/job-specific system that restores and caches specific directories before and after each job executes, storing the data in S3. We could implement something similar but it would require a reliable central store, and the whole setup seems a little 'brute force' to me.

With tar and s3cmd this would probably be a shell one liner. If we can't get a swift or ceph object store for labs from ops, we could use rsync to an integration instance.

If we go this way to ensure isolation is maintained we need to make sure that only nodepool instances have permission to update caches that were run for gate-and-submit or post-merge, but not for test/check.

I looked a bit at setting up a shared proxy using Squid3 and its ability to do SSL man in the middle (SSL-BUMP). That causes a few troubles, we would need to inject our own CA and apparently that does not work with npm which has the npmjs.org CA embedded :-/

I am going to try  devpi for python modules. It supports dynamic loading from PyPi and serving packages from a local cache.

There are similar utilities for other package mangers:

Haven't looked at composer/packagist for now.

If we go that way, we would have to maintain 3-4 different caching proxies. It is probably easier to rely on third parties that already figured out all the troubles, so we would save a lot on the initial implementation.

Note: that does not cache native modules that require compilation. Although Pypi has the concept of wheels which are precompiled module so at least that language will benefit from the caching provided the module has wheels.

I have created the instance pmcache.integration.eqiad.wmflabs to play test with it. The package managers can be instructed to point to it via homedir dot files.

Also found Sonatype Nexus which is a Java/Maven repository system. It seems to support proxying/caching for both npm and gem:

hashar renamed this task from Disposable VMs need a cache for package managers to [tracking] Disposable VMs need a cache for package managers.Oct 7 2015, 8:10 PM
hashar triaged this task as Medium priority.
hashar moved this task from Backlog to In-progress on the Continuous-Integration-Scaling board.
hashar added a project: Tracking-Neverending.
hashar set Security to None.

I gave a try to Artifactory (similar to Sonatype Nexus) but support for npm/pip/gem is not included in the open source version. Looking at upstream code, it seems all the related code is actually public, so in theory we could rebuild Artifactory to enable the missing bits (which would grant us the Pro version), but I am unsure whether it is legally acceptable.

Another idea was to setup a Squid proxy for https request and configure it to act as a man in the middle. That can be done using the Squid feature ssl-bump and custom certificates. There are a few problems though:

  • we will need to install the certificate on all clients
  • figure out how to get the cert trusted for each package manager (but maybe they just trust whatever is under /etc/ssl/certs
  • Squid 3.4 on Debian Jessie does not come with SSL support due to a conflicting copyright/license with OpenSSL. We could craft our own package or backport Squid 3.5 which supports GNU TLS.

We should definitely shy away from NFS as a solution for this.

I have filled T116017 about setting up a cache store / restore system. Probably using rsync and a dedicated instance with a bunch of space. Once we have a Swift store, we can migrate the system to Swift.

I have added back T116038 as a child task. It is part of the evaluation.

We should definitely shy away from NFS as a solution for this.

Definitely. We do not even have NFS share on the Nodepool instances :-}

Change 264327 had a related patch set uploaded (by Hashar):
castor: package managers cache storage

https://gerrit.wikimedia.org/r/264327

Did a first pass using a cache store/restore system based on rsync. Investigated as part of T116017

Currently deployed on https://integration.wikimedia.org/ci/job/integration-jjb-config-diff/

Change 264327 merged by jenkins-bot:
castor: package managers cache storage

https://gerrit.wikimedia.org/r/264327

Change 265502 had a related patch set uploaded (by Hashar):
Enable castor on {name}-tox-{toxenv}-jessie jobs

https://gerrit.wikimedia.org/r/265502

Change 265502 merged by jenkins-bot:
Enable castor on {name}-tox-{toxenv}-jessie jobs

https://gerrit.wikimedia.org/r/265502

So I went with a rsync based approach which I have nicknamed CASTOR for CAche STORage. It is implemented via Jenkins jjb/castor.yaml and I am going to enable it on the jobs running on Nodepool.

That will let us migrate the Javascript jobs ( T119143 )

hashar claimed this task.

Being bold. This is solved by adding in JJB jobs running on Nodepool:

builders:
 - castor-load
 # rest of builders
publishers:
 # Other publishers
 - castor-save

I am updating the jobs.

Change 265741 had a related patch set uploaded (by Hashar):
Enable castor on tox-{toxenv}-jessie jobs

https://gerrit.wikimedia.org/r/265741

Change 265744 had a related patch set uploaded (by Hashar):
Enable castor on tox-jessie job

https://gerrit.wikimedia.org/r/265744

Change 265741 merged by jenkins-bot:
Enable castor on tox-{toxenv}-jessie jobs

https://gerrit.wikimedia.org/r/265741

Change 265744 merged by jenkins-bot:
Enable castor on tox-jessie job

https://gerrit.wikimedia.org/r/265744

Change 265747 had a related patch set uploaded (by Hashar):
Enable castor on rake-jessie

https://gerrit.wikimedia.org/r/265747

I have enabled castor on the rake-jessie job as well. Gave it a try on https://gerrit.wikimedia.org/r/#/c/252698/2 which is merged. I reenqueued it in gate-and-submit to have the cache populated then 'recheck'.

Build time went from 42 seconds to 13 seconds. Cache pays off!

Change 265747 merged by jenkins-bot:
Enable castor on rake-jessie

https://gerrit.wikimedia.org/r/265747