Page MenuHomePhabricator

Limit GitLab shared runners to images from Wikimedia Docker registry
Closed, ResolvedPublic2 Estimated Story Points

Description

We (@thcipriani and I, at least) discussed this during early exploration of using GitLab for CI, and concluded it was the right thing to do. I realized we're not doing it currently, might be worth some discussion.

Background docs:

Event Timeline

Change 724472 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[operations/puppet@production] gitlab-runner: restrict allowed images and services

https://gerrit.wikimedia.org/r/724472

This seems reasonable and expected. gitlab.wikimedia.org is not a general purpose code forge but instead for projects in support of Wikimedia projects, just like Gerrit and our current CI infra.

@mmodell points out that the templates provided for .gitlab-ci.yml by default point to upstream Docker images.

Example: https://gitlab.wikimedia.org/releng/ddd/-/blob/3191421f27679959e377d34fb9142492cfefaffe/.gitlab-ci.yml#L16

Tagging Security-Team for awareness, per discussion.

Is this intended to be a security limitation or performance or ...?

Will we be allowed to use the base bullseye:latest image and apt-get install whatever we want? If so, can other apt repositories be added and used? At that point, what's the difference in allowing upstream or arbitrary docker images?

Or will integration/config still be used to gatekeep what packages can be installed?

At that point, what's the difference in allowing upstream or arbitrary docker images?

That's a fair question, and this is exactly the discussion I'd like to have. I think there's probably a reasonable argument to be made that if we trust the people deciding what Docker images to use, then this is just an unnecessary technical constraint. Certainly it seems to go against the grain of upstream's intended and documented usage to some extent.

I think maybe there's an equally reasonable argument that limiting the set of images would allow us to better ensure that they're audited and prevent situations where, e.g., somebody's effectively running a malicious binary blob by accident.

I think that many of the upstream docker images should be fine, for instance, images maintained by trustworthy organizations / open source projects. I don't know if there is a way to allow a list of blessed images / orgs with some kind of allow-list but that would also require extra maintenance work.

There is also the advantage of network-locality with our own registry which would improve performance over fetching images from the public internet. Even with local cache that might be a significant performance hit (though I don't have a good sense of how much impact that would have overall)

I think maybe there's an equally reasonable argument that limiting the set of images would allow us to better ensure that they're audited and prevent situations where, e.g., somebody's effectively running a malicious binary blob by accident.

+1 from a security perspective, IMO. Smaller attack surface and, in theory, more audited than a random image from hub.docker.com. But is there a sizable risk reduction over using popular deb/ubuntu FOSS images or even something like alpine? Probably not.

I think part of my question got missed, specifically:

Will we be allowed to use the base bullseye:latest image and apt-get install whatever we want? If so, can other apt repositories be added and used?

If this *is* allowed, then there's less reason to use docker hub-based images.

In case it helps, here are some non-hypothetical scenarios of past requests:

  1. Someone wants to use a new golang version that isn't packaged in Debian yet (T283425). They could either use the upstream golang image (https://hub.docker.com/_/golang) or pull a base Debian image from our registry and use curl/wget to download a build version from https://golang.org/dl/, extract it and use it
  2. Someone wants to use the latest upstream Rust version (T256827). This means updating the integration/config Rust images every ~6 weeks when a new release comes out. Those images are identical to the upstream Rust images, so it's just duplicating work instead of pulling the upstream images.
  3. Someone wants to use a new Python version that is packaged in Debian and active use in production, but isn't in a published image yet and has been -1'd to go in integration/config (T289222). They could just use a Debian base image and apt-get install python3.9 or get use an upstream python:3.9 image.

I have a hard time seeing in any of those cases why restricting the use of docker hub-based images is anything other than causing busy work for people (it might not feel like that because it's also the status quo). In #1 especially, it seems it would be slower and worse for performance to have each build download and cache golang tarballs instead of caching a shared golang image.

I think maybe there's an equally reasonable argument that limiting the set of images would allow us to better ensure that they're audited and prevent situations where, e.g., somebody's effectively running a malicious binary blob by accident.

Unless we stop using npm (and probably pip/composer too!) and don't allow arbitrary internet access, CI will be running arbitrary binary blobs downloaded from random places that no one has reviewed. So I don't know what is really gained security-wise here, given we're already pwned.

Thinking a bit more... I don't know if GitLab supports it, but one compromise would be to allow so-called "official" Docker Hub images like https://hub.docker.com/_/golang and https://hub.docker.com/_/rust - see https://github.com/docker-library/official-images

At least for me I think that plus apt-get would address all my concerns and use-cases.

To answer the original question:

Will we be allowed to use the base bullseye:latest image and apt-get install whatever we want? If so, can other apt repositories be added and used?

We haven't done anything to prohibit this, and doing so hasn't been on my radar personally at least.

Thanks for clearly articulating the case against the limitation. It is a fairly compelling one, especially from a busywork-reduction point of view.

As I understand it, the status quo with regards to the CI security model is that we don't want to allow code execution on the persistent CI agents and don't trust Docker to be a sufficient isolation model in the sense that we don't currently allow sudo/root execution within the Docker containers.

I haven't analysed this recently or thought about it much, and I don't know Linux kernel details well. But, I would not be surprised if the ability to choose a third-party container would be considered equivalent to having root within the Docker process. Can someone confirm/deny that? I imagine that for the container to be initialised, its local concept of root does exist and could perhaps run some amount of image code before the designated "nobody"-user command springs into action etc.

But, I would not be surprised if the ability to choose a third-party container would be considered equivalent to having root within the Docker process. Can someone confirm/deny that? I imagine that for the container to be initialised, its local concept of root does exist and could perhaps run some amount of image code before the designated "nobody"-user command springs into action etc.

Given that apt-get is allowed, users will have root access on all containers, whether they're third-party or not.

But, I would not be surprised if the ability to choose a third-party container would be considered equivalent to having root within the Docker process. Can someone confirm/deny that? I imagine that for the container to be initialised, its local concept of root does exist and could perhaps run some amount of image code before the designated "nobody"-user command springs into action etc.

Given that apt-get is allowed, users will have root access on all containers, whether they're third-party or not.

Neither root nor apt-get are allowed in the CI agents we have today with Jenkins/Zuul. It would be good to document somewhere the security analysis and risk assessment if we decide to accept this for GitLab CI going forward, or (if the case) that RelEng was comfortable with this for many years already but just didn't enable it. It is my understanding that many hundreds of work hours have been spent on working around this restriction in the name of security.

I understand this task is about the image registry, but I don't think we should take for granted that root will be allowed in GitLab, just because GitLab happens to work that way by default, and that we happen to have it set up to run those containers directly on persistent VMs. This also seems contrary to upstream expectation and security support, as I'm guessing GitLab.com (and GitHub/Travis for that matter) don't run their user's root-enabled containers directly on persistent VMs or hardware they care about. Travis did have a short-lived experiment with running (non-root!) user containers directly on their cloud, but despite not being root-enabled even that was quickly ended after misuse, in favour of returning to a disposable VM pool.

Thanks everyone for the thoughtful input. I'm going to try to circle some folks up in the coming few weeks and think through the issues here, document approaches more thoroughly, etc.


In the meanwhile, noting so I don't lose track of it that it's interesting to look through the CI templates provided upstream:

https://gitlab.com/gitlab-org/gitlab/tree/master/lib/gitlab/ci/templates

Some of these lean on actions like running apt:

15:58:18 brennen@inertia:~/code/gitlab-org/gitlab/lib/gitlab/ci/templates (master=) ❁ git grep apt
Android.gitlab-ci.yml:  - apt-get --quiet update --yes
Android.gitlab-ci.yml:  - apt-get --quiet install --yes wget tar unzip lib32stdc++6 lib32z1
Android.latest.gitlab-ci.yml:  - apt-get --quiet update --yes
Android.latest.gitlab-ci.yml:  - apt-get --quiet install --yes wget tar unzip lib32stdc++6 lib32z1
C++.gitlab-ci.yml:  #   - apt update && apt -y install make autoconf
Chef.gitlab-ci.yml:#     - apt-get update
Chef.gitlab-ci.yml:#     - apt-get -y install rsync
Chef.gitlab-ci.yml:#     - apt-get update
Chef.gitlab-ci.yml:#     - apt-get -y install rsync
Clojure.gitlab-ci.yml:  # - apt-get update -y
Crystal.gitlab-ci.yml:  - apt-get update -qq && apt-get install -y -qq libxml2-dev
Django.gitlab-ci.yml:  # - apt-get update -q && apt-get install nodejs -yqq
Grails.gitlab-ci.yml:  - apt-get update -qq && apt-get install -y -qq unzip
Julia.gitlab-ci.yml:    - apt-get update -qq && apt-get install -y git  # needed by Documenter
Laravel.gitlab-ci.yml:  - apt-get update -yqq
Laravel.gitlab-ci.yml:  - apt-get install gnupg -yqq
Laravel.gitlab-ci.yml:  - apt-get install git nodejs libcurl4-gnutls-dev libicu-dev libmcrypt-dev libvpx-dev libjpeg-dev libpng-dev libxpm-dev zlib1g-dev libfreetype6-dev libxml2-dev libexp
PHP.gitlab-ci.yml:  - apt-get update -yqq
PHP.gitlab-ci.yml:  - apt-get install -yqq git libpq-dev libcurl4-gnutls-dev libicu-dev libvpx-dev libjpeg-dev libpng-dev libxpm-dev zlib1g-dev libfreetype6-dev libxml2-dev libexpat1-dev li
Pages/JBake.gitlab-ci.yml:  - apt-get update -qq && apt-get install -y -qq unzip zip
Pages/Jigsaw.gitlab-ci.yml:  - apt-get update -yqq
Pages/Jigsaw.gitlab-ci.yml:  - apt-get install -yqq gnupg zlib1g-dev libpng-dev
Pages/Jigsaw.gitlab-ci.yml:  - apt-get install -yqq nodejs
Pages/Middleman.gitlab-ci.yml:    - apt-get update -yqqq
Pages/Middleman.gitlab-ci.yml:    - apt-get install -y nodejs
Pages/Middleman.gitlab-ci.yml:    - apt-get update -yqqq
Pages/Middleman.gitlab-ci.yml:    - apt-get install -y nodejs
Pages/Octopress.gitlab-ci.yml:    - apt-get update -qq && apt-get install -qq nodejs
Ruby.gitlab-ci.yml:  # - apt-get update -q && apt-get install nodejs -yqq
Rust.gitlab-ci.yml:#   - apt-get update -yqq
Rust.gitlab-ci.yml:#   - apt-get install -yqq --no-install-recommends build-essential
Scala.gitlab-ci.yml:  - apt-get update -yqq
Scala.gitlab-ci.yml:  - apt-get install apt-transport-https -yqq
Scala.gitlab-ci.yml:  - echo "deb http://dl.bintray.com/sbt/debian /" | tee -a /etc/apt/sources.list.d/sbt.list
Scala.gitlab-ci.yml:  - apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
Scala.gitlab-ci.yml:  - apt-get update -yqq
Scala.gitlab-ci.yml:  - apt-get install sbt -yqq
Security/Coverage-Fuzzing.gitlab-ci.yml:    - if [ -x "$(command -v apt-get)" ] ; then apt-get update && apt-get install -y wget; fi

...but lots of others seem clearly written for images that are already batteries-included.

It does seem like some additional security modeling could help us pinpoint exactly what the risks and attack surfaces are or are not when running containerized jobs. I'm not going to do a good job of that, but I do feel there's some important differentiation to be done in this conversation.

In my view, there is an important distinction between the risks of:

  1. Containerized root execution of files from authenticated/authorized sources.
  2. Containerized user (non-root) execution of files from authenticated/authorized sources.
  3. Containerized root execution of files from arbitrary sources.
  4. Containerized user execution of files from arbitrary sources.

I suppose you could consider these four quadrants of two root/user verified/arbitrary axes. What risks are introduced with each dimension?

Containerized root execution

In other words, the effective UID of the process inside the container is root. My understanding, at least with Docker, is that it also means the UID is mapped to root on the host. The main risk here is privilege escalation to host root, and the main defense against that is the set of capabilities with which Docker (+ apparmor/whatever) has configured the container. Container escape is possible if there is a vulnerability in that set of capabilities or in the kernel, and again container escape means getting host root.

Containerized user (non-root) execution

The effective UID of the process inside the container is some non-root user mapped to some non-root host user. It's my understanding that there is less risk of container escape under this condition as those escapes usually require root techniques and so an attacker would first have to deal with their user mode dilemma. If they did somehow escape without first gaining root in the container, they would escape as a non-root host user.

Authenticated/authorized sources

Files across the filesystem both root owned and not (including things like sudo or setuid(0) executables) come only from sources that have been authenticated and authorized by WMF. When it comes to things like Debian packages and apt, we're really talking about whether GPG keys are present in the root image and that the root image comes from a source under WMF control. When we're talking about base images we're talking about restricting registries from which container runtimes can pull.

This seems like a mechanism with a wide attack surface for reasons already mentioned, namely that it's difficult to audit everything from even a single source. However, we can decide which third-parties we trust enough to allow directly or mirror.

And there's interaction here with the effective process UID being root or user. When it comes to things like sudo and setuid and other root owned things, ensuring some degree of source authn/z means ensuring that non-root processes stay non-root.

Arbitrary sources

Now we get into a tricky area I think, because which sources do we mean exactly and when does this interact with root/user execution in a problematic way?

If we're talking about either apt packages or root images, this always means files across the filesystem both root owned or not (again this includes sudo or setuid executables) can come from anywhere. In this case, it effectively negates any enforcement of the runtime process user and we lose the first layer of protection against root container escape.

If we're talking about project package managers (npm and so forth), this has the same risk but only if we're executing as root and/or writing root-owned files that can be later exploited.

And then there's project code. It's arbitrary because we're executing against unmerged patchsets.


Ok, this is a mess of words now. Sorry. :) IMO:

  1. We must ensure container entrypoint processes are run non-root.
  2. We must restrict image and apt sources because otherwise we can't ensure the integrity of root owner files and so we can't ensure effective non-root execution.
  3. We should not care that much about arbitrary sources from which packages are installed so long as that occurs during non-root image build steps or runtime processes (Blubber enforces this restriction for us now in the deployment pipeline) and so long as 1 and 2 are satisfied.

In other words, the effective UID of the process inside the container is root. My understanding, at least with Docker, is that it also means the UID is mapped to root on the host.

FWIW there are some options to avoid this like:

Not sure if gitlab-runner is compatible with those though.

Ok, this is a mess of words now. Sorry. :) IMO:

  1. We must ensure container entrypoint processes are run non-root.
  2. We must restrict image and apt sources because otherwise we can't ensure the integrity of root owner files and so we can't ensure effective non-root execution.
  3. We should not care that much about arbitrary sources that get installed from during unprivileged image build time (Blubber enforces this restriction for us now) or runtime so long as 1 and 2 are in place.

TBH this list of requirements (specifically #2) seems mostly like the status quo of CI, which is fine (not a regression!), but I worry this doesn't set us up for what people expect as "self-serve CI" listed on the future CI requirements.

So the containers will be running in a VM environment, no? And we can ensure that they restore to a snapshot or known-good baseline before testing a new patch? If we limit the absolute resource usage of the virtual machine and ensure a clean wipe/reset after each run then I think that root escape would have minimal real impact? Am I missing something?

In other words, the effective UID of the process inside the container is root. My understanding, at least with Docker, is that it also means the UID is mapped to root on the host.

FWIW there are some options to avoid this like:

Not sure if gitlab-runner is compatible with those though.

I've seen a lot of attempts at rootless Docker (and other container runtimes) in image building toolchains, e.g. w/ buildkit, kaniko, buildpacks, etc. but I'm not familiar enough to say whether we can do that across the board for all CI containers. It's definitely worth looking into.

I was just made aware of the research @Jelto has been doing around possible platforms for GitLab CI in T286958: Document long-term requirements for GitLab job runners and here. Perhaps there's more info there.

  1. We must ensure container entrypoint processes are run non-root.
  2. We must restrict image and apt sources because otherwise we can't ensure the integrity of root owner files and so we can't ensure effective non-root execution.
  3. We should not care that much about arbitrary sources that get installed from during unprivileged image build time (Blubber enforces this restriction for us now) or runtime so long as 1 and 2 are in place.

TBH this list of requirements (specifically #2) seems mostly like the status quo of CI, which is fine (not a regression!), but I worry this doesn't set us up for what people expect as "self-serve CI" listed on the future CI requirements.

I think yes and no. For example, we have these restrictions currently for the PipelineLib/Blubber based deployment pipelines. However, they still afford a high degree of self-service by allowing the project to define any number of image variants to build and/or execute. Container base images are restricted to those maintained by WMF as are—by extension—apt sources/packages, but the images can be extended according to the Blubber configs before being executed by stages in the pipeline.yaml. It doesn't allow for completely ad-hoc containers but it is still self-service and I would argue that it's the constraints/contract enforced by Blubber and PipelineLib that have allowed the system to be made available to developers without much operational concern.

See T291017: Proof of concept for Buildpacks as WMF image build tool in GitLab for some recently enumerated requirements (see the mind map) for evaluating Blubber and Cloud Native Buildpacks as a possible Blubber replacement. We're really trying to strike the right balance between "developer empowering" and "operating standards."

So the containers will be running in a VM environment, no? And we can ensure that they restore to a snapshot or known-good baseline before testing a new patch? If we limit the absolute resource usage of the virtual machine and ensure a clean wipe/reset after each run then I think that root escape would have minimal real impact? Am I missing something?

As of now we're using WMCS VPSs, but even then I don't think we want to or can do a wipe after each run. The future may look very different (see T286958: Document long-term requirements for GitLab job runners).

What are we worried about

  • Executing random code (as root or not)
    • 🪙Crypto mining/DDoS
    • 🐍Injecting malicious code into artifacts or code/privilege escalation

Executing as root probably makes it easier to poison the shared runners, but the worries are the same either way—either they poison the runner as root or they use the container to do something bad.

These worries won't be alleviated by limiting our base images, but an allowlist of base images is meant to make it harder to use CI to mine crypto or escalate privilege.

I see at least two solutions:


🛡️"Trusted" images

This is status quo for current CI.

  • Limit docker images to our repo
  • Limit root execution within containers to apt install

Problems

  • Equally hard to create a crypto-miner as to use a new PHP version
  • Hard to model trust: who can add images? When do we add images?

👐 Make CI require less trust

FWIW, I really want ephemeral runners. I think this is how CI systems all work now.

  • Isolated ephemeral runners
  • CI minutes limit sustained crypto-mining/DDoSing

Problems

  • In the distant past, Nodepool didn't work well with WMCS
  • We've been talking about k8s CI runners for years, but have never been resourced to do it

None of the above is GitLab specific, and I don't want to hold up GitLab progress until we find a perfect solution. I also don't want to make it hard for people to get things done in CI (unless those people are crypto miners or would-be DDoSers).

WMCS, KDE, and our current CI all limit base images with the goal of ensuring trusted groups can use CI while adding roadblocks for bad people.

While KDE limits base images, their config is a little more open than ours:

allowed_images = ["kdeorg/*:*", "ubuntu:*", "debian:*", "fedora:*", "centos/*:*", "opensuse/*:*", "python:*", "ruby:*", "mcr.microsoft.com/powershell:*", "registry.gitlab.com/gitlab-org/security-products/analyzers/*:*"]

^ @brennen recently proposed doing something like this as a good compromise, and I think that makes sense

Change 737801 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[operations/puppet@production] gitlab runners: define an allowlist for images

https://gerrit.wikimedia.org/r/737801

These worries won't be alleviated by limiting our base images, but an allowlist of base images is meant to make it harder to use CI to mine crypto or escalate privilege.

Agreed that different base images make it harder to escalate privileges, but I don't think it really makes any difference for crypto mining.

WMCS, KDE, and our current CI all limit base images with the goal of ensuring trusted groups can use CI while adding roadblocks for bad people.

While KDE limits base images, their config is a little more open than ours:

allowed_images = ["kdeorg/*:*", "ubuntu:*", "debian:*", "fedora:*", "centos/*:*", "opensuse/*:*", "python:*", "ruby:*", "mcr.microsoft.com/powershell:*", "registry.gitlab.com/gitlab-org/security-products/analyzers/*:*"]

^ @brennen recently proposed doing something like this as a good compromise, and I think that makes sense

This seems like a good idea to me. What will the criteria be for adding new upstream images? (My request is of course going to be for rust:* and rustlang/rust:nightly. I and I think addshore also use registry.gitlab.com/gitlab-org/release-cli:latest to have CI create new GitLab releases.)

Change 737801 abandoned by Brennen Bearnes:

[operations/puppet@production] gitlab runners: define an allowlist for images

Reason:

Rebasing 724472 instead.

https://gerrit.wikimedia.org/r/737801

What will the criteria be for adding new upstream images? (My request is of course going to be for rust:* and rustlang/rust:nightly. I and I think addshore also use registry.gitlab.com/gitlab-org/release-cli:latest to have CI create new GitLab releases.)

Essentially I think it's "seems reasonable and is actively maintained", but we could probably come up with a more detailed list.

First pass on https://gerrit.wikimedia.org/r/c/operations/puppet/+/724472 including those suggestions.

thcipriani changed the point value for this task from 5 to 2.May 4 2022, 4:58 PM

Talked about doing this as part of Release-Engineering-Team (GitLab-a-thon 🦊) next week. Changed story point value to reflect remaining work.

brennen changed the task status from Open to In Progress.May 10 2022, 2:41 PM
brennen claimed this task.
brennen triaged this task as High priority.

Tested against a shared runner in WMCS. Works as expected. Sample error message:

ERROR: The "hashicorp/terraform" image is not present on list of allowed images:
- docker-registry.wikimedia.org/*
- centos/*:*
- debian:*
- fedora:*
- opensuse/*:*
- ubuntu:*
- python:*
- ruby:*
- rust:*
- rustlang/rust:nightly
- registry.gitlab.com/gitlab-org/*
Please check runner's configuration:
		https://docs.gitlab.com/runner/configuration/advanced-configuration.html
		#restricting-docker-images-and-services
ERROR: Failed to remove network for build
ERROR: Preparation failed: disallowed image
Will be retried in 3s ...
ERROR: Job failed (system failure): disallowed image
brennen changed the task status from In Progress to Stalled.May 17 2022, 10:10 PM
brennen moved this task from Doing to Needs/Waiting Review on the User-brennen board.

Nice! Don't suppose there's a way to get it to spit out a custom message that says they're locked down intentionally at Wikimedia GitLab and point at a page where people can ask for a specific image to be added to the list if there's some special extra need? Otherwise people will be stranded without knowing exactly where to go. But the error is already pretty clear on what images are allowed.

Don't suppose there's a way to get it to spit out a custom message that says they're locked down intentionally at Wikimedia GitLab and point at a page where people can ask for a specific image to be added to the list if there's some special extra need?

That would definitely be nice to have. I don't see anything obvious to hook into in allowed_images.go. The dirty hack that leaps to mind is a fake docker-registry.wikimedia.org/WIKIMEDIA_IMAGES_DELIBERATELY_LOCKED_SEE_NAME_OF_HELPFUL_WIKI_PAGE or something.

Change 724472 merged by Jelto:

[operations/puppet@production] gitlab runner: restrict docker images and services

https://gerrit.wikimedia.org/r/724472

brennen moved this task from Needs/Waiting Review to Done or Declined on the User-brennen board.

Noting from IRC:

<jelto> brennen: I merged gitlab runner: restrict docker images and services (https://gerrit.wikimedia.org/r/c/operations/puppet/+/724472) and re-registered Trusted Runners. I did not re-registered WMCS runners

I re-registered the shared runners.

Thanks!