Spike: Evaluate experimental Docker based CI w/ scap builds
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dduvall
	Nov 11 2016, 12:50 AM

Description

Local tests with Docker have proven fruitful enough to set up a labs instance for further experimentation. Below are some questions I hope we can answer following the spike.

Apparent benefits and open questions

Efficient caching of build dependencies through native Docker image caching
- What would be the storage requirement here and can we easily set up a central cache?
- What kinds of dependencies can be easily cached? (system packages vs. system packages + composer/gem/pip/etc.)
Simplification of CI infrastructure
- Pre-provisioned slaves should require far fewer dependencies
- CloudBees Docker plugin is well maintained and simple to employ (there's a JJB wrapper, too)
- What are the benefits and drawbacks of provided base Dockerfiles/images vs. simply defering to the repo?
Lighter weight isolation
- Is Docker isolation sufficient?
  - Should Security weigh in?
  - What's the attack surface look like? Better/worse than nodepool instances?
- What's the overhead between builds?
- How many executors can we support per instance?
- How well can we clean-up?

Related Objects
Search...

Status	Assigned	Task
Resolved	dduvall	T150501 Spike: Evaluate experimental Docker based CI w/ scap builds
Resolved	thcipriani	T150502 Set up experimental Docker CI slave
Resolved	dduvall	T150504 Define generic job that runs unit tests within a Docker container
Declined	None	T150505 Install and configure Jenkins Docker plugin

Event Timeline

dduvall created this task.Nov 11 2016, 12:50 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 11 2016, 12:50 AM

dduvall created subtask T150502: Set up experimental Docker CI slave.Nov 11 2016, 12:53 AM

dduvall created subtask T150504: Define generic job that runs unit tests within a Docker container.Nov 11 2016, 1:01 AM

dduvall created subtask T150505: Install and configure Jenkins Docker plugin.Nov 11 2016, 1:06 AM

• mmodell subscribed.Nov 11 2016, 1:44 AM

hashar triaged this task as Medium priority.Nov 18 2016, 2:56 PM

hashar moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.

thcipriani subscribed.Nov 21 2016, 3:05 AM

dduvall closed subtask T150505: Install and configure Jenkins Docker plugin as Declined.Nov 28 2016, 6:33 PM

dduvall closed subtask T150504: Define generic job that runs unit tests within a Docker container as Resolved.Dec 15 2016, 8:45 PM

From our notes for the 12/8 meeting with Ops regarding the direction of CI:

Proof of concept

A recent proof of concept was implemented to explore a Docker based alternative to the current Nodepool approach. It focuses on three main areas of simplification of infrastructure, more lightweight isolation, and efficient caching, and is comprised of the following infrastructural changes:

One permanent slave in the integration project (integration-slave-docker-1000)
Minimal ops/puppet configuration to provision slave as a Jenkins worker with (no additional dependencies installed beyond: Jenkins slave base dependencies + docker)
- https://gerrit.wikimedia.org/r/#/c/320942/ (waiting on Ops review but cherry picked and working)
- https://gerrit.wikimedia.org/r/#/c/321485/ (waiting on Ops review but cherry picked and working)
Generalized Jenkins job with a minimal shell script for the Docker build, run, artifact copying, and cleanup. (Note the experiment began with an assumption that the official CloudBees Docker Custom Build Environment Plugin would provide this portion of the implementation but there were notable limitations and problematic behaviors of the plugin and very little added benefit.)
Generalized Harbormaster build plan for integration between Differential and Jenkins
Dockerfiles implemented for both malu and scap to test viability of repo-deferred build environment configuration

The proof of concept revealed some immediately apparent benefits and some potential drawbacks:

Benefits

Simplification of CI slave configuration over both the permanent slave and Nodepool approaches. Since all environment configuration for the container is deferred to the repo’s Dockerfile.ci, there is little configuration to be done by ops/puppet. The setup simply requires the common labs configuration, a Jenkins slave, and Docker. With the right version pinning, the deltas between Puppet runs are unlikely to cause disruption to the stability of the worker.
Lighter weight isolation which translates to less overall build overhead
Fewer “moving parts.” Going back to permanent instances (or even dedicated hardware) means less load on OpenStack and other Ops-team infrastructure. Note that this setup is not mutually exclusive to a automated scaling, but the lack of it may not be all that relevant.
Improved security within the worker itself. The Jenkins slave process, which is vulnerable to slave-to-master execution exploits, runs on the Docker host, while the build entrypoint runs within the container context. This execution model may reduce the overall attack surface of a Jenkins slave, but we’d need further investigation by Security or Operations to confirm this.
Greater concurrency within the worker itself. Multiple executors per worker are now entirely possible—without the risk of filesystem collision—given the container based isolation.
Deferred configuration. The Dockerfile.ci written and maintained directly by developers (eg: sudo apt-get install libfoo-dev) instead of having to craft a puppet.git patch and get releng to refresh the image.
Docker’s efficient, incremental, low-level caching can be utilized to cache system dependencies and packages. Note that making full use of this caching mechanism can be slightly tricky and will require devs to have some basic Docker knowledge.

Drawbacks

Docker does not perform any sort of cache cleanup on its own, and the image layers can grow quite large on a single host. We’ll need some sort of cleanup process (with a mutex) to keep worker filesystem usage in check.
Developers have to have some basic Docker knowledge to make full use of its incremental/layered caching, primarily the way Docker uses RUN strings and COPY mtimes/checksums to identify/cache intermediate layers. We can of course provide documentation and base images to help in this regard.
If we want to ease the configuration burden placed on developers, reduce overall duplication of basic configuration, and achieve greater consistency with environments further down the pipeline, we will need to maintain a registry and a suite of base images.

Open questions

Is Docker a thing for production or for labs? Heard whispers about Rkt? The direction that Ops takes will have implications for how we proceed with an efficient and consistent release pipeline, so communication and collaboration are important here.
- If, for example, we’re headed toward a containerized future in production, would we need the CI infrastructure to be congruent with staging and production so that we’re promoting image builds, not rebuilding at each stage?
Are there other requirements on the ops side?
Is this the right level of abstraction for repo maintainers or could they benefit from something even easier? (e.g. something like travis.yml)
Are permanent instances sufficient for this setup or should we pursue dedicated hardware? (Either way, we have a upper bound for scaling as we do with current Labs/Nodepool quotas.)
If/how to build and maintain base images
Will we have a Wikimedia equivalent of Dockerhub? (tools as a private registry) (we do already!)
How to garbage collect intermediate and unused (dangling) images?
Can we centralize the Docker cache so that intermediate images can be reused across the fleet?
Do we really need Jenkins at all since there’s only a single generalized job? Or could this setup work just as well with “pure” Harbormaster workers?

Conclusion and follow-up

All in all, this was a successful proof of concept but in addition to our open questions, the meeting with Ops and a follow-up meeting with @chasemp revealed that some of the additional need for caching/hosting of images and cleanup might be provided by Kubernettes. Since Ops is already investigating k8s for production and it's already working well for Tool Labs, we should pursue a k8s PoC in the same vein and see how it stacks up to this one.

Paladox subscribed.Dec 15 2016, 8:54 PM

dduvall mentioned this in T153363: Spike: Evaluate containerized CI builds using Kubernetes.Dec 15 2016, 9:47 PM

fgiunchedi subscribed.Jan 4 2017, 5:11 PM

greg added a project: Release-Engineering-Team (Kanban).May 16 2017, 3:55 PM

greg closed this task as Resolved.May 17 2017, 1:52 PM

greg moved this task from Backlog to Done (within RelEng) on the Release-Engineering-Team (Kanban) board.

thcipriani closed subtask T150502: Set up experimental Docker CI slave as Resolved.Aug 10 2017, 4:42 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban).Sep 26 2017, 11:46 PM

Aklapper removed a subscriber: Release-Engineering-Team.May 16 2023, 1:13 PM