We're currently using 23 m1.medium instances as Jenkins agents to run Docker based jobs, each with a single executor. There are a few potential drawbacks to this configuration and I'd like to explore alternatives.
- Docker image caches are ineffective when spread out over so many machines. While we shouldn't be relying on Docker caching for basic functionality, I think it will be an increasingly important optimization in our CI stack as we run more container based jobs and further pipeline adoption. Note that this issue is going to be a larger issue for jobs that build images with a high degree of possible variance, i.e. pipeline jobs that use Blubber.
- We control the way these containers are executed (jobs using Blubber will run non-root within the container), so running each job in its own OpenStack VMs seems like unnecessary overhead without adding much security benefit.
- Limiting to a single executor for even a m1.medium seems like it wouldn't fully utilize the instance's VCPUs. Looking at grafana-labs, most integration-slave-docker-* don't peak above 50% user CPU utilization that often and the cluster's average utilization is also quite low. See grafana-labs.
One simple alternative might be to migrate to larger instance types (m1.xlarge) and up the number of executors of each node's agent to something more aggressive (e.g. num_cpus * 1.25 but we'd likely want to experiment).
I'd like to hear thoughts from others on this. I vaguely remember a reason for the single executor setup, but does this reason still apply when doing container based builds? Can we solve the problem in other ways?