Page MenuHomePhabricator

Create a special-purpose Trusted Runner with Dockerfile frontend
Closed, ResolvedPublic

Description

In recent discussions (T351792, T356418) the need of GitLab Runners with enabled Dockerfile frontend was discussed. Currently Trusted Runners support blubber only.

We agreed that Dockerfile frontend should not be generally available in production. For production images builds based on blubber provide proven security settings which are hard to replicate with raw Dockerfiles. So most images should be build with blubber.

However we identified a few use-cases which need Dockerfile support. For example the build of base-images and production-images (currently handled by docker-pkg).
Also some images used in the CI stack (like integration/config or buildkit) are based on a Dockerfile.

So for this uses cases and to unblock firsts tests with docker-pkg on GitLab we decided to convert one of the Trusted Runners to a special purpose Trusted Dockerfile Runner. We agreed the Runner should be available to repos/releng and repos/sre GitLab group. This clashes a bit with the current policy of Trusted Runners because the access has to be requested explicitly on a per-project basis currently. So making one Trusted Runner available to all RelEng and SRE projects would change the policy a bit.

Rough todos:

  • self-build docker.io/docker/dockerfile-upstream with the new Dockerfile Runner: docker-registry.discovery.wmnet/repos/releng/buildkit/dockerfile-frontend:experiment
  • decide if Trusted Dockerfile Runner should be available to all repos/releng and repos/sre or explicitly requested (similar to current Trusted Runners)
  • create a dedicated Trusted Runner type and get the token in repos/releng/gitlab-settings/runner-config.json
  • adapt automation in https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/ to make sure multiple classes of Trusted Runners can be managed
  • unregister one of the Trusted Runners
  • update profile::gitlab::runner::token with the new token for one of the Trusted Runners in private hiera
  • add Dockerfile frontend to profile::gitlab::runner::buildkitd_allowed_frontends and profile::gitlab::runner::buildkitd_allowed_gateway_sources
  • Run puppet on the new Trusted Runner
  • add https://gitlab.wikimedia.org/repos/releng/buildkit as a first project to the Trusted Dockerfile Runner and test
  • add a Dockerfile Runner to the test environment
  • update docs

For other, non-production use cases (like T351792) we could also start a discussion of enabling Dockerfile frontend on the Cloud/WMCS Runners in a dedicated Task. But this would require a dedicated Docker registry and additional work and maintenance.

Details

TitleReferenceAuthorSource BranchDest Branch
update dockerfile-frontend image tagrepos/releng/buildkit!61jeltoupdate-dockerfile-frontendwmf/v0.12
fix BUILD_VARIANT for dockerfile image buildrepos/releng/buildkit!60jeltofix-build-variantwmf/v0.12
fix syntax in buildkit Dockerfilerepos/releng/buildkit!59jeltofix-dockerfile-syntax-buildkitwmf/v0.12
fix syntax in dockerfile-frontend Dockerfilerepos/releng/buildkit!58jeltofix-dockerfile-syntaxwmf/v0.12
manage dockerfile trusted runners in add-project.py scriptrepos/releng/gitlab-trusted-runner!66jeltomanage-dockerfile-runnersmain
add Trusted Dockerfile Runner types to test and prod environmentrepos/releng/gitlab-settings!59jeltoadd-dockerfile-runnersmain
add CI pipeline for dockerfile-frontendrepos/releng/buildkit!55jeltoadd-ci-dockerfilewmf/v0.12
Customize query in GitLab

Event Timeline

Jelto triaged this task as High priority.Feb 15 2024, 10:06 AM

Change 1013049 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: temporary allow dockerfile frontend on Trusted Runners

https://gerrit.wikimedia.org/r/1013049

Change #1013049 merged by Jelto:

[operations/puppet@production] gitlab: temporary allow dockerfile frontend on Trusted Runners

https://gerrit.wikimedia.org/r/1013049

Change #1013261 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "gitlab: temporary allow dockerfile frontend on Trusted Runners"

https://gerrit.wikimedia.org/r/1013261

Change #1013261 merged by Jelto:

[operations/puppet@production] Revert "gitlab: temporary allow dockerfile frontend on Trusted Runners"

https://gerrit.wikimedia.org/r/1013261

Change #1014005 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: unregister gitlab-runner2004 for dockerfile conversion

https://gerrit.wikimedia.org/r/1014005

Change #1014005 merged by Jelto:

[operations/puppet@production] gitlab_runner: unregister gitlab-runner2004 for dockerfile conversion

https://gerrit.wikimedia.org/r/1014005

Jelto updated the task description. (Show Details)

Change #1014485 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: allow dockerfile frontend on gitlab-runner2004

https://gerrit.wikimedia.org/r/1014485

Just a note that Toolforge has a collection of images that are built from Dockerfiles as well. There have been a few past discussions in the cloud-services-team, mostly very long ago now, about automating builds of these containers. I am interested to hear how this initial trusted Docker runner experiment works out.

Change #1014485 merged by Jelto:

[operations/puppet@production] gitlab_runner: allow dockerfile frontend on gitlab-runner2004

https://gerrit.wikimedia.org/r/1014485

The Trusted Dockerfile Runner gitlab-runner2004 is available now. The first project which is allowed to use this runner is buildkit. I merged the change above to build the dockerfile frontend image also in CI, which should be a good test.

@dancy are you okay to push a new tag for buildkit to trigger the image build pipeline?

If this works as expected, the only missing steps are to update docs about the Dockerfile Runner and install a Dockerfile Runner in our test environment as well.

Just a note that Toolforge has a collection of images that are built from Dockerfiles as well. There have been a few past discussions in the cloud-services-team, mostly very long ago now, about automating builds of these containers. I am interested to hear how this initial trusted Docker runner experiment works out.

Thanks for bringing this topic up. Do you know in which registry this images live? I could not find them in production. I guess it's docker-registry.toolforge.org?
In theory we could setup another Dockerfile Runner in WMCS (or just re-brand one of the existing runners there) and allow Dockerfile builds for this toolforge projects. For the production registry we configured a jwt-authorizer to control which machines can push images to the registry. I guess that's also needed for the toolforge registry, so we would need something similar there as well.

The Trusted Dockerfile Runner gitlab-runner2004 is available now. The first project which is allowed to use this runner is buildkit. I merged the change above to build the dockerfile frontend image also in CI, which should be a good test.

@dancy are you okay to push a new tag for buildkit to trigger the image build pipeline?

If this works as expected, the only missing steps are to update docs about the Dockerfile Runner and install a Dockerfile Runner in our test environment as well.

Two jobs failed due to 'docker/dockerfile-upstream:master' is not an allowed gateway frontend:
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235180
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235181

Just a note that Toolforge has a collection of images that are built from Dockerfiles as well. There have been a few past discussions in the cloud-services-team, mostly very long ago now, about automating builds of these containers. I am interested to hear how this initial trusted Docker runner experiment works out.

Thanks for bringing this topic up. Do you know in which registry this images live? I could not find them in production. I guess it's docker-registry.toolforge.org?

The backend repository for these images is hosted at docker-registry.tools.wmflabs.org, but yes these are the images documented by docker-registry.toolforge.org.

In theory we could setup another Dockerfile Runner in WMCS (or just re-brand one of the existing runners there) and allow Dockerfile builds for this toolforge projects.

I don't know enough about the pros and cons of re-branding vs spinning up a new instance to choose, but a dedicated runner does seem like a good idea because of the various separation concerns.

For the production registry we configured a jwt-authorizer to control which machines can push images to the registry. I guess that's also needed for the toolforge registry, so we would need something similar there as well.

Yes, we would definitely want to have some access control that keeps arbitrary projects from pushing to the docker-registry.tools.wmflabs.org backend. This registry is currently read-only to anyone outside the Toolforge admin team. We have an entirely separate registry for tool maintainer controlled containers that is part of our build service system.

@dcaro, @aborrero, and @taavi should be pulled into the planning process when folks have time and energy to work on this idea.

For that use case converting the repository to something like docker-pkg would be a better idea I think. I don't think we want to rebuild all the images for every single commit.

The Trusted Dockerfile Runner gitlab-runner2004 is available now. The first project which is allowed to use this runner is buildkit. I merged the change above to build the dockerfile frontend image also in CI, which should be a good test.

@dancy are you okay to push a new tag for buildkit to trigger the image build pipeline?

If this works as expected, the only missing steps are to update docs about the Dockerfile Runner and install a Dockerfile Runner in our test environment as well.

Two jobs failed due to 'docker/dockerfile-upstream:master' is not an allowed gateway frontend:
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235180
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235181

This issue should be fixed by using the correct image in the # syntax stanza.

I pushed another tag wmf-v0.12.5-10 but the build still fails. See
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235485
https://gitlab.wikimedia.org/repos/releng/buildkit/-/jobs/235486

The error is:

error: failed to solve: exit code: 1

Buildkit process on the host logs:

Apr 03 07:50:32 gitlab-runner2004 docker[1356474]: runc run failed: unable to start container process: exec: "/run": permission denied
Apr 03 07:50:32 gitlab-runner2004 docker[1356474]: time="2024-04-03T07:50:32Z" level=error msg="/moby.buildkit.v1.frontend.LLBBridge/Solve returned error: rpc error: code = Unknown desc = exit code: 1"
Apr 03 07:50:32 gitlab-runner2004 docker[1356474]: time="2024-04-03T07:50:32Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = exit code: 1"

So I guess thats a problem with our self-build docker-registry.wikimedia.org/repos/releng/buildkit/dockerfile-frontend image? I also updated the experiment branch in the Buildkit project to make debugging this issue a bit easier. I'll do some more troubleshooting with the dockerfile-frontend image.

Regarding the toolforge image build: I'll create a dedicated task once the Trusted Dockerfile is in place and working. This is a little bit different scope and most likely needs another runner in WMCS as far as I understand that.

Host rebooted by jelto@cumin1002 with reason: None

I've done some more research regarding self-building the dockerfile-frontend image. I compared the upstream image and the wmf image and it's quite obvious that they are different images

docker-registry.wikimedia.org/repos/releng/buildkit/dockerfile-frontend   experiment                     37dbb1928608   2 weeks ago     302MB
docker/dockerfile-upstream                                                latest                         ab56f6885c98   4 weeks ago     23.8MB

The dockerfile is a multi-stage build. In the pipeline to self-build the dockerfile-frontend image the BUILD_VARIANT is set to build (see). So the image which gets published is the golang image with all the build dependencies.

I tried to set BUILD_VARIANT set to release locally and it produced a image similar size and behavior as docker/dockerfile-upstream. The release variant just copies the go binary to a scratch so it's significantly smaller than the build variant.

I'll try another round of building the dockerfile-frontend in the https://gitlab.wikimedia.org/repos/releng/buildkit pipeline with the correct variant.

Change #1018245 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004

https://gerrit.wikimedia.org/r/1018245

Change #1018245 merged by Jelto:

[operations/puppet@production] gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004

https://gerrit.wikimedia.org/r/1018245

Change #1018219 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] Revert "gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004"

https://gerrit.wikimedia.org/r/1018219

Change #1018219 merged by Jelto:

[operations/puppet@production] Revert "gitlab_runner: temporary allow dockerfile frontend on gitlab-runner2004"

https://gerrit.wikimedia.org/r/1018219

Rebuild of docker-registry.wikimedia.org/repos/releng/buildkit/dockerfile-frontend was successful, see pipelines. I also pushed a new tag wmf-v0.12.5-11 and the build of buildkit and the dockerfile-frontend images were successful, see pipelines.

Change #1018944 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add dockerfile support for test runner in WMCS

https://gerrit.wikimedia.org/r/1018944

Change #1018944 merged by Jelto:

[operations/puppet@production] gitlab_runner: add dockerfile support for test runner in WMCS

https://gerrit.wikimedia.org/r/1018944

The Trusted Dockerfile Runner is available now and firsts tests with building the buildkit image were successful. I also adjusted the docs and added Dockerfile support to one of the test runners as well.

@bd808 @taavi regarding Dockerfile image builds for cloud: the general purpose Kubernetes Runners should have Dockerfile support already ("All untrusted runners support both Blubber and Dockerfile container configurations"). So you can also use this runners for first tests (just job tag: kubernetes or cloud). Some kind of tooling is then needed to upload this image to the correct toolforge registry.

So I'm closing this task. If there is some followup work needed (for example for the toolforge images) feel free to tag collaboration-services.