We (@thcipriani and I, at least) discussed this during early exploration of using GitLab for CI, and concluded it was the right thing to do. I realized we're not doing it currently, might be worth some discussion.
Background docs:
We (@thcipriani and I, at least) discussed this during early exploration of using GitLab for CI, and concluded it was the right thing to do. I realized we're not doing it currently, might be worth some discussion.
Background docs:
Change 724472 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):
[operations/puppet@production] gitlab-runner: restrict allowed images and services
This seems reasonable and expected. gitlab.wikimedia.org is not a general purpose code forge but instead for projects in support of Wikimedia projects, just like Gerrit and our current CI infra.
@mmodell points out that the templates provided for .gitlab-ci.yml by default point to upstream Docker images.
Is this intended to be a security limitation or performance or ...?
Will we be allowed to use the base bullseye:latest image and apt-get install whatever we want? If so, can other apt repositories be added and used? At that point, what's the difference in allowing upstream or arbitrary docker images?
Or will integration/config still be used to gatekeep what packages can be installed?
At that point, what's the difference in allowing upstream or arbitrary docker images?
That's a fair question, and this is exactly the discussion I'd like to have. I think there's probably a reasonable argument to be made that if we trust the people deciding what Docker images to use, then this is just an unnecessary technical constraint. Certainly it seems to go against the grain of upstream's intended and documented usage to some extent.
I think maybe there's an equally reasonable argument that limiting the set of images would allow us to better ensure that they're audited and prevent situations where, e.g., somebody's effectively running a malicious binary blob by accident.
I think that many of the upstream docker images should be fine, for instance, images maintained by trustworthy organizations / open source projects. I don't know if there is a way to allow a list of blessed images / orgs with some kind of allow-list but that would also require extra maintenance work.
There is also the advantage of network-locality with our own registry which would improve performance over fetching images from the public internet. Even with local cache that might be a significant performance hit (though I don't have a good sense of how much impact that would have overall)
+1 from a security perspective, IMO. Smaller attack surface and, in theory, more audited than a random image from hub.docker.com. But is there a sizable risk reduction over using popular deb/ubuntu FOSS images or even something like alpine? Probably not.
I think part of my question got missed, specifically:
Will we be allowed to use the base bullseye:latest image and apt-get install whatever we want? If so, can other apt repositories be added and used?
If this *is* allowed, then there's less reason to use docker hub-based images.
In case it helps, here are some non-hypothetical scenarios of past requests:
I have a hard time seeing in any of those cases why restricting the use of docker hub-based images is anything other than causing busy work for people (it might not feel like that because it's also the status quo). In #1 especially, it seems it would be slower and worse for performance to have each build download and cache golang tarballs instead of caching a shared golang image.
Unless we stop using npm (and probably pip/composer too!) and don't allow arbitrary internet access, CI will be running arbitrary binary blobs downloaded from random places that no one has reviewed. So I don't know what is really gained security-wise here, given we're already pwned.
Thinking a bit more... I don't know if GitLab supports it, but one compromise would be to allow so-called "official" Docker Hub images like https://hub.docker.com/_/golang and https://hub.docker.com/_/rust - see https://github.com/docker-library/official-images
At least for me I think that plus apt-get would address all my concerns and use-cases.
To answer the original question:
Will we be allowed to use the base bullseye:latest image and apt-get install whatever we want? If so, can other apt repositories be added and used?
We haven't done anything to prohibit this, and doing so hasn't been on my radar personally at least.
Thanks for clearly articulating the case against the limitation. It is a fairly compelling one, especially from a busywork-reduction point of view.
As I understand it, the status quo with regards to the CI security model is that we don't want to allow code execution on the persistent CI agents and don't trust Docker to be a sufficient isolation model in the sense that we don't currently allow sudo/root execution within the Docker containers.
I haven't analysed this recently or thought about it much, and I don't know Linux kernel details well. But, I would not be surprised if the ability to choose a third-party container would be considered equivalent to having root within the Docker process. Can someone confirm/deny that? I imagine that for the container to be initialised, its local concept of root does exist and could perhaps run some amount of image code before the designated "nobody"-user command springs into action etc.
Given that apt-get is allowed, users will have root access on all containers, whether they're third-party or not.
Neither root nor apt-get are allowed in the CI agents we have today with Jenkins/Zuul. It would be good to document somewhere the security analysis and risk assessment if we decide to accept this for GitLab CI going forward, or (if the case) that RelEng was comfortable with this for many years already but just didn't enable it. It is my understanding that many hundreds of work hours have been spent on working around this restriction in the name of security.
I understand this task is about the image registry, but I don't think we should take for granted that root will be allowed in GitLab, just because GitLab happens to work that way by default, and that we happen to have it set up to run those containers directly on persistent VMs. This also seems contrary to upstream expectation and security support, as I'm guessing GitLab.com (and GitHub/Travis for that matter) don't run their user's root-enabled containers directly on persistent VMs or hardware they care about. Travis did have a short-lived experiment with running (non-root!) user containers directly on their cloud, but despite not being root-enabled even that was quickly ended after misuse, in favour of returning to a disposable VM pool.
Thanks everyone for the thoughtful input. I'm going to try to circle some folks up in the coming few weeks and think through the issues here, document approaches more thoroughly, etc.
In the meanwhile, noting so I don't lose track of it that it's interesting to look through the CI templates provided upstream:
https://gitlab.com/gitlab-org/gitlab/tree/master/lib/gitlab/ci/templates
Some of these lean on actions like running apt:
15:58:18 brennen@inertia:~/code/gitlab-org/gitlab/lib/gitlab/ci/templates (master=) ❁ git grep apt Android.gitlab-ci.yml: - apt-get --quiet update --yes Android.gitlab-ci.yml: - apt-get --quiet install --yes wget tar unzip lib32stdc++6 lib32z1 Android.latest.gitlab-ci.yml: - apt-get --quiet update --yes Android.latest.gitlab-ci.yml: - apt-get --quiet install --yes wget tar unzip lib32stdc++6 lib32z1 C++.gitlab-ci.yml: # - apt update && apt -y install make autoconf Chef.gitlab-ci.yml:# - apt-get update Chef.gitlab-ci.yml:# - apt-get -y install rsync Chef.gitlab-ci.yml:# - apt-get update Chef.gitlab-ci.yml:# - apt-get -y install rsync Clojure.gitlab-ci.yml: # - apt-get update -y Crystal.gitlab-ci.yml: - apt-get update -qq && apt-get install -y -qq libxml2-dev Django.gitlab-ci.yml: # - apt-get update -q && apt-get install nodejs -yqq Grails.gitlab-ci.yml: - apt-get update -qq && apt-get install -y -qq unzip Julia.gitlab-ci.yml: - apt-get update -qq && apt-get install -y git # needed by Documenter Laravel.gitlab-ci.yml: - apt-get update -yqq Laravel.gitlab-ci.yml: - apt-get install gnupg -yqq Laravel.gitlab-ci.yml: - apt-get install git nodejs libcurl4-gnutls-dev libicu-dev libmcrypt-dev libvpx-dev libjpeg-dev libpng-dev libxpm-dev zlib1g-dev libfreetype6-dev libxml2-dev libexp PHP.gitlab-ci.yml: - apt-get update -yqq PHP.gitlab-ci.yml: - apt-get install -yqq git libpq-dev libcurl4-gnutls-dev libicu-dev libvpx-dev libjpeg-dev libpng-dev libxpm-dev zlib1g-dev libfreetype6-dev libxml2-dev libexpat1-dev li Pages/JBake.gitlab-ci.yml: - apt-get update -qq && apt-get install -y -qq unzip zip Pages/Jigsaw.gitlab-ci.yml: - apt-get update -yqq Pages/Jigsaw.gitlab-ci.yml: - apt-get install -yqq gnupg zlib1g-dev libpng-dev Pages/Jigsaw.gitlab-ci.yml: - apt-get install -yqq nodejs Pages/Middleman.gitlab-ci.yml: - apt-get update -yqqq Pages/Middleman.gitlab-ci.yml: - apt-get install -y nodejs Pages/Middleman.gitlab-ci.yml: - apt-get update -yqqq Pages/Middleman.gitlab-ci.yml: - apt-get install -y nodejs Pages/Octopress.gitlab-ci.yml: - apt-get update -qq && apt-get install -qq nodejs Ruby.gitlab-ci.yml: # - apt-get update -q && apt-get install nodejs -yqq Rust.gitlab-ci.yml:# - apt-get update -yqq Rust.gitlab-ci.yml:# - apt-get install -yqq --no-install-recommends build-essential Scala.gitlab-ci.yml: - apt-get update -yqq Scala.gitlab-ci.yml: - apt-get install apt-transport-https -yqq Scala.gitlab-ci.yml: - echo "deb http://dl.bintray.com/sbt/debian /" | tee -a /etc/apt/sources.list.d/sbt.list Scala.gitlab-ci.yml: - apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823 Scala.gitlab-ci.yml: - apt-get update -yqq Scala.gitlab-ci.yml: - apt-get install sbt -yqq Security/Coverage-Fuzzing.gitlab-ci.yml: - if [ -x "$(command -v apt-get)" ] ; then apt-get update && apt-get install -y wget; fi
...but lots of others seem clearly written for images that are already batteries-included.
It does seem like some additional security modeling could help us pinpoint exactly what the risks and attack surfaces are or are not when running containerized jobs. I'm not going to do a good job of that, but I do feel there's some important differentiation to be done in this conversation.
In my view, there is an important distinction between the risks of:
I suppose you could consider these four quadrants of two root/user verified/arbitrary axes. What risks are introduced with each dimension?
In other words, the effective UID of the process inside the container is root. My understanding, at least with Docker, is that it also means the UID is mapped to root on the host. The main risk here is privilege escalation to host root, and the main defense against that is the set of capabilities with which Docker (+ apparmor/whatever) has configured the container. Container escape is possible if there is a vulnerability in that set of capabilities or in the kernel, and again container escape means getting host root.
The effective UID of the process inside the container is some non-root user mapped to some non-root host user. It's my understanding that there is less risk of container escape under this condition as those escapes usually require root techniques and so an attacker would first have to deal with their user mode dilemma. If they did somehow escape without first gaining root in the container, they would escape as a non-root host user.
Files across the filesystem both root owned and not (including things like sudo or setuid(0) executables) come only from sources that have been authenticated and authorized by WMF. When it comes to things like Debian packages and apt, we're really talking about whether GPG keys are present in the root image and that the root image comes from a source under WMF control. When we're talking about base images we're talking about restricting registries from which container runtimes can pull.
This seems like a mechanism with a wide attack surface for reasons already mentioned, namely that it's difficult to audit everything from even a single source. However, we can decide which third-parties we trust enough to allow directly or mirror.
And there's interaction here with the effective process UID being root or user. When it comes to things like sudo and setuid and other root owned things, ensuring some degree of source authn/z means ensuring that non-root processes stay non-root.
Now we get into a tricky area I think, because which sources do we mean exactly and when does this interact with root/user execution in a problematic way?
If we're talking about either apt packages or root images, this always means files across the filesystem both root owned or not (again this includes sudo or setuid executables) can come from anywhere. In this case, it effectively negates any enforcement of the runtime process user and we lose the first layer of protection against root container escape.
If we're talking about project package managers (npm and so forth), this has the same risk but only if we're executing as root and/or writing root-owned files that can be later exploited.
And then there's project code. It's arbitrary because we're executing against unmerged patchsets.
Ok, this is a mess of words now. Sorry. :) IMO:
FWIW there are some options to avoid this like:
Not sure if gitlab-runner is compatible with those though.
Ok, this is a mess of words now. Sorry. :) IMO:
- We must ensure container entrypoint processes are run non-root.
- We must restrict image and apt sources because otherwise we can't ensure the integrity of root owner files and so we can't ensure effective non-root execution.
- We should not care that much about arbitrary sources that get installed from during unprivileged image build time (Blubber enforces this restriction for us now) or runtime so long as 1 and 2 are in place.
TBH this list of requirements (specifically #2) seems mostly like the status quo of CI, which is fine (not a regression!), but I worry this doesn't set us up for what people expect as "self-serve CI" listed on the future CI requirements.
So the containers will be running in a VM environment, no? And we can ensure that they restore to a snapshot or known-good baseline before testing a new patch? If we limit the absolute resource usage of the virtual machine and ensure a clean wipe/reset after each run then I think that root escape would have minimal real impact? Am I missing something?
I've seen a lot of attempts at rootless Docker (and other container runtimes) in image building toolchains, e.g. w/ buildkit, kaniko, buildpacks, etc. but I'm not familiar enough to say whether we can do that across the board for all CI containers. It's definitely worth looking into.
I was just made aware of the research @Jelto has been doing around possible platforms for GitLab CI in T286958: Document long-term requirements for GitLab job runners and here. Perhaps there's more info there.
I think yes and no. For example, we have these restrictions currently for the PipelineLib/Blubber based deployment pipelines. However, they still afford a high degree of self-service by allowing the project to define any number of image variants to build and/or execute. Container base images are restricted to those maintained by WMF as are—by extension—apt sources/packages, but the images can be extended according to the Blubber configs before being executed by stages in the pipeline.yaml. It doesn't allow for completely ad-hoc containers but it is still self-service and I would argue that it's the constraints/contract enforced by Blubber and PipelineLib that have allowed the system to be made available to developers without much operational concern.
See T291017: Proof of concept for Buildpacks as WMF image build tool in GitLab for some recently enumerated requirements (see the mind map) for evaluating Blubber and Cloud Native Buildpacks as a possible Blubber replacement. We're really trying to strike the right balance between "developer empowering" and "operating standards."
As of now we're using WMCS VPSs, but even then I don't think we want to or can do a wipe after each run. The future may look very different (see T286958: Document long-term requirements for GitLab job runners).
Executing as root probably makes it easier to poison the shared runners, but the worries are the same either way—either they poison the runner as root or they use the container to do something bad.
These worries won't be alleviated by limiting our base images, but an allowlist of base images is meant to make it harder to use CI to mine crypto or escalate privilege.
I see at least two solutions:
This is status quo for current CI.
Problems
FWIW, I really want ephemeral runners. I think this is how CI systems all work now.
Problems
None of the above is GitLab specific, and I don't want to hold up GitLab progress until we find a perfect solution. I also don't want to make it hard for people to get things done in CI (unless those people are crypto miners or would-be DDoSers).
WMCS, KDE, and our current CI all limit base images with the goal of ensuring trusted groups can use CI while adding roadblocks for bad people.
While KDE limits base images, their config is a little more open than ours:
allowed_images = ["kdeorg/*:*", "ubuntu:*", "debian:*", "fedora:*", "centos/*:*", "opensuse/*:*", "python:*", "ruby:*", "mcr.microsoft.com/powershell:*", "registry.gitlab.com/gitlab-org/security-products/analyzers/*:*"]
^ @brennen recently proposed doing something like this as a good compromise, and I think that makes sense
Change 737801 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):
[operations/puppet@production] gitlab runners: define an allowlist for images
Agreed that different base images make it harder to escalate privileges, but I don't think it really makes any difference for crypto mining.
WMCS, KDE, and our current CI all limit base images with the goal of ensuring trusted groups can use CI while adding roadblocks for bad people.
While KDE limits base images, their config is a little more open than ours:
allowed_images = ["kdeorg/*:*", "ubuntu:*", "debian:*", "fedora:*", "centos/*:*", "opensuse/*:*", "python:*", "ruby:*", "mcr.microsoft.com/powershell:*", "registry.gitlab.com/gitlab-org/security-products/analyzers/*:*"]
^ @brennen recently proposed doing something like this as a good compromise, and I think that makes sense
This seems like a good idea to me. What will the criteria be for adding new upstream images? (My request is of course going to be for rust:* and rustlang/rust:nightly. I and I think addshore also use registry.gitlab.com/gitlab-org/release-cli:latest to have CI create new GitLab releases.)
Change 737801 abandoned by Brennen Bearnes:
[operations/puppet@production] gitlab runners: define an allowlist for images
Reason:
Rebasing 724472 instead.
What will the criteria be for adding new upstream images? (My request is of course going to be for rust:* and rustlang/rust:nightly. I and I think addshore also use registry.gitlab.com/gitlab-org/release-cli:latest to have CI create new GitLab releases.)
Essentially I think it's "seems reasonable and is actively maintained", but we could probably come up with a more detailed list.
First pass on https://gerrit.wikimedia.org/r/c/operations/puppet/+/724472 including those suggestions.
Talked about doing this as part of Release-Engineering-Team (GitLab-a-thon 🦊) next week. Changed story point value to reflect remaining work.
Tested against a shared runner in WMCS. Works as expected. Sample error message:
ERROR: The "hashicorp/terraform" image is not present on list of allowed images: - docker-registry.wikimedia.org/* - centos/*:* - debian:* - fedora:* - opensuse/*:* - ubuntu:* - python:* - ruby:* - rust:* - rustlang/rust:nightly - registry.gitlab.com/gitlab-org/* Please check runner's configuration: https://docs.gitlab.com/runner/configuration/advanced-configuration.html #restricting-docker-images-and-services ERROR: Failed to remove network for build ERROR: Preparation failed: disallowed image Will be retried in 3s ... ERROR: Job failed (system failure): disallowed image
Nice! Don't suppose there's a way to get it to spit out a custom message that says they're locked down intentionally at Wikimedia GitLab and point at a page where people can ask for a specific image to be added to the list if there's some special extra need? Otherwise people will be stranded without knowing exactly where to go. But the error is already pretty clear on what images are allowed.
Don't suppose there's a way to get it to spit out a custom message that says they're locked down intentionally at Wikimedia GitLab and point at a page where people can ask for a specific image to be added to the list if there's some special extra need?
That would definitely be nice to have. I don't see anything obvious to hook into in allowed_images.go. The dirty hack that leaps to mind is a fake docker-registry.wikimedia.org/WIKIMEDIA_IMAGES_DELIBERATELY_LOCKED_SEE_NAME_OF_HELPFUL_WIKI_PAGE or something.
Change 724472 merged by Jelto:
[operations/puppet@production] gitlab runner: restrict docker images and services
Noting from IRC:
<jelto> brennen: I merged gitlab runner: restrict docker images and services (https://gerrit.wikimedia.org/r/c/operations/puppet/+/724472) and re-registered Trusted Runners. I did not re-registered WMCS runners
I re-registered the shared runners.
Thanks!