Page MenuHomePhabricator

Spike: Evaluate experimental Docker based CI w/ scap builds
Closed, ResolvedPublic

Description

Local tests with Docker have proven fruitful enough to set up a labs instance for further experimentation. Below are some questions I hope we can answer following the spike.

Apparent benefits and open questions

  • Efficient caching of build dependencies through native Docker image caching
    • What would be the storage requirement here and can we easily set up a central cache?
    • What kinds of dependencies can be easily cached? (system packages vs. system packages + composer/gem/pip/etc.)
  • Simplification of CI infrastructure
    • Pre-provisioned slaves should require far fewer dependencies
    • CloudBees Docker plugin is well maintained and simple to employ (there's a JJB wrapper, too)
    • What are the benefits and drawbacks of provided base Dockerfiles/images vs. simply defering to the repo?
  • Lighter weight isolation
    • Is Docker isolation sufficient?
      • Should Security weigh in?
      • What's the attack surface look like? Better/worse than nodepool instances?
    • What's the overhead between builds?
    • How many executors can we support per instance?
    • How well can we clean-up?

Event Timeline

hashar triaged this task as Medium priority.Nov 18 2016, 2:56 PM

From our notes for the 12/8 meeting with Ops regarding the direction of CI:

Proof of concept

A recent proof of concept was implemented to explore a Docker based alternative to the current Nodepool approach. It focuses on three main areas of simplification of infrastructure, more lightweight isolation, and efficient caching, and is comprised of the following infrastructural changes:

  • One permanent slave in the integration project (integration-slave-docker-1000)
  • Minimal ops/puppet configuration to provision slave as a Jenkins worker with (no additional dependencies installed beyond: Jenkins slave base dependencies + docker)
  • Generalized Jenkins job with a minimal shell script for the Docker build, run, artifact copying, and cleanup. (Note the experiment began with an assumption that the official CloudBees Docker Custom Build Environment Plugin would provide this portion of the implementation but there were notable limitations and problematic behaviors of the plugin and very little added benefit.)
  • Generalized Harbormaster build plan for integration between Differential and Jenkins
  • Dockerfiles implemented for both malu and scap to test viability of repo-deferred build environment configuration

The proof of concept revealed some immediately apparent benefits and some potential drawbacks:

Benefits

  • Simplification of CI slave configuration over both the permanent slave and Nodepool approaches. Since all environment configuration for the container is deferred to the repo’s Dockerfile.ci, there is little configuration to be done by ops/puppet. The setup simply requires the common labs configuration, a Jenkins slave, and Docker. With the right version pinning, the deltas between Puppet runs are unlikely to cause disruption to the stability of the worker.
  • Lighter weight isolation which translates to less overall build overhead
  • Fewer “moving parts.” Going back to permanent instances (or even dedicated hardware) means less load on OpenStack and other Ops-team infrastructure. Note that this setup is not mutually exclusive to a automated scaling, but the lack of it may not be all that relevant.
  • Improved security within the worker itself. The Jenkins slave process, which is vulnerable to slave-to-master execution exploits, runs on the Docker host, while the build entrypoint runs within the container context. This execution model may reduce the overall attack surface of a Jenkins slave, but we’d need further investigation by Security or Operations to confirm this.
  • Greater concurrency within the worker itself. Multiple executors per worker are now entirely possible—without the risk of filesystem collision—given the container based isolation.
  • Deferred configuration. The Dockerfile.ci written and maintained directly by developers (eg: sudo apt-get install libfoo-dev) instead of having to craft a puppet.git patch and get releng to refresh the image.
  • Docker’s efficient, incremental, low-level caching can be utilized to cache system dependencies and packages. Note that making full use of this caching mechanism can be slightly tricky and will require devs to have some basic Docker knowledge.

Drawbacks

  • Docker does not perform any sort of cache cleanup on its own, and the image layers can grow quite large on a single host. We’ll need some sort of cleanup process (with a mutex) to keep worker filesystem usage in check.
  • Developers have to have some basic Docker knowledge to make full use of its incremental/layered caching, primarily the way Docker uses RUN strings and COPY mtimes/checksums to identify/cache intermediate layers. We can of course provide documentation and base images to help in this regard.
  • If we want to ease the configuration burden placed on developers, reduce overall duplication of basic configuration, and achieve greater consistency with environments further down the pipeline, we will need to maintain a registry and a suite of base images.

Open questions

  • Is Docker a thing for production or for labs? Heard whispers about Rkt? The direction that Ops takes will have implications for how we proceed with an efficient and consistent release pipeline, so communication and collaboration are important here.
    • If, for example, we’re headed toward a containerized future in production, would we need the CI infrastructure to be congruent with staging and production so that we’re promoting image builds, not rebuilding at each stage?
  • Are there other requirements on the ops side?
  • Is this the right level of abstraction for repo maintainers or could they benefit from something even easier? (e.g. something like travis.yml)
  • Are permanent instances sufficient for this setup or should we pursue dedicated hardware? (Either way, we have a upper bound for scaling as we do with current Labs/Nodepool quotas.)
  • If/how to build and maintain base images
  • Will we have a Wikimedia equivalent of Dockerhub? (tools as a private registry) (we do already!)
  • How to garbage collect intermediate and unused (dangling) images?
  • Can we centralize the Docker cache so that intermediate images can be reused across the fleet?
  • Do we really need Jenkins at all since there’s only a single generalized job? Or could this setup work just as well with “pure” Harbormaster workers?

Conclusion and follow-up

All in all, this was a successful proof of concept but in addition to our open questions, the meeting with Ops and a follow-up meeting with @chasemp revealed that some of the additional need for caching/hosting of images and cleanup might be provided by Kubernettes. Since Ops is already investigating k8s for production and it's already working well for Tool Labs, we should pursue a k8s PoC in the same vein and see how it stacks up to this one.