Page MenuHomePhabricator

Setup GitLab Runner in trusted environment
Closed, ResolvedPublic

Description

This task is for tracking the setup of GitLab Runners in a trusted environment.

In T286958 we discussed the long term requirements for GitLab Runners. One class of Runners should run in production environments (eqiad, codfw) and execute jobs which handle sensitive credentials and produce artifacts running in production. See also https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner#Specific_GitLab_Runners.

I would like to reuse the existing puppet code for the Shared Runners in WMCS. I think we could start with VMs on Ganeti and later order dedicated machines and/or migrate to some Kubernetes platform.
The Runners must not be used by arbitrary jobs but only by certain projects and branches. So this runners will be setup as Specific Runners, probably executing only jobs for protected branches.

Roughly the needed steps are:

  • setup dedicated ganeti VMs in codfw and eqiad (gitlab-runner1001 and gitlab-runner2001)
  • adjust puppet code and install role on new ganeti VMs
  • register new runners as specific runners and run test job
  • add monitoring (Prometheus metrics and some Grafana dashboards)
  • validate GitLab application permission concept ⌛
    • runners should execute protected branches only
    • runners should be available only to allowed projects
    • non-privileged project pipelines (feature branches) can not escalate privileges by altering gitlab-ci.yml file
    • make sure trusted runners must use protected branches
    • document permission and security concept
  • validate host security concept
    • runner jobs shouldn't be able to connect to other WMF service or hosts (except explicitly permitted, like docker-registry, apt repos, chart museum)
      • harden firewall rules so no other services can be reached from GitLab Runner Docker containers
    • runners shouldn't be able to execute code with root privileges/escalate to root privileges
      • prevent privileged containers (privileged = false)
      • run gitlab-runner as non-root user
      • evaluate if containers can be executed as non-root See T320411
      • evaluate if certain capabilities can be dropped See T320411
  • create automation for managing and requesting access to Trusted Runners /repos/releng/gitlab-trusted-runner

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+11 -1
operations/puppetproduction+1 -1
operations/puppetproduction+12 -0
operations/puppetproduction+6 -0
operations/cookbooksmaster+2 -6
operations/cookbooksmaster+49 -0
operations/puppetproduction+1 -0
operations/puppetproduction+24 -14
operations/puppetproduction+6 -2
operations/puppetproduction+3 -17
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+81 -26
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -3
operations/puppetproduction+27 -13
operations/puppetproduction+28 -11
operations/puppetproduction+20 -4
operations/puppetproduction+7 -1
operations/puppetproduction+1 -1
operations/puppetproduction+33 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+47 -51
operations/puppetproduction+1 -0
operations/puppetproduction+20 -11
operations/puppetproduction+8 -8
operations/puppetproduction+1 -0
operations/puppetproduction+4 -4
operations/puppetproduction+1 -1
operations/puppetproduction+69 -20
labs/privatemaster+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+19 -0
operations/puppetproduction+0 -0
operations/puppetproduction+1 -1
operations/puppetproduction+10 -0
operations/puppetproduction+1 -1
operations/puppetproduction+13 -1
operations/puppetproduction+15 -3
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 747539 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: use config template for registering new runners

https://gerrit.wikimedia.org/r/747539

Change 748114 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: disable disk check for docker volumes

https://gerrit.wikimedia.org/r/748114

Change 748114 merged by Jelto:

[operations/puppet@production] gitlab_runner: disable disk check for docker volumes

https://gerrit.wikimedia.org/r/748114

I've done some more testing around managing the GitLab Runner configuration file config.toml using puppet. The initial problem was that the gitlab-runner register command alters the config file during runtime (which conflicts with puppets workflow).

I see multiple options we could try to progress here:

  1. let puppet only create the config file but not alter it later (so that changes won't get overwritten by puppet)
  2. split the config and try to include some sub config somehow doesn't supported by toml specification
  3. use a config template managed by puppet and use the --template-config parameter during registering (implemented in /operations/puppet/+/747539).
  4. Don't register the runners with gitlab-runner register but with runner api and have a static, puppet managed config file
  5. Invent some more complex logic which merges multiple files during a puppet run

I implemented option 3. because for me it's the best compromise. Option 1. sounds misleading, because puppet code changes wouldn't have an effect on actual production infrastructure immediately. Option 5. sounds fragile and complex to implement. So option 3. explicitly creates a template, which is used during registration. It should be clear that the template has an effect on the registration workflow only and not during runtime. But I also think this is a major downside of this option. This would mean we have to un-register the runners (ensure absent) and re-register them (ensure present) for config changes to take effect.

I would favor option 4. and have all config files and tokens as static artifacts inside of puppet. But registering multiple runners and adding all of the tokens to private puppet would mean quite some work and we have to somehow automate this workflow.

I would like to discuss this issue with folks either here or in our weekly session.

I agree that option 1 sounds misleading and not great and option 5 sounds overly complex / brittle. Fully on the same page with you here.

My immediate thought was that 4 is the best but that I probably don't understand yet why that isn't so easy. Maybe let's talk some more about the process to get tokens into private puppet in an efficient way.

Change 751452 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics

https://gerrit.wikimedia.org/r/751452

Change 747539 merged by Jelto:

[operations/puppet@production] gitlab_runner: use config template for registering new runners

https://gerrit.wikimedia.org/r/747539

Change 752137 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: fix missing url in registration command

https://gerrit.wikimedia.org/r/752137

Change 752137 merged by Jelto:

[operations/puppet@production] gitlab_runner: fix missing url in registration command

https://gerrit.wikimedia.org/r/752137

Change 752138 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: fix missing parameters in registration command

https://gerrit.wikimedia.org/r/752138

Change 752138 merged by Jelto:

[operations/puppet@production] gitlab_runner: fix missing parameters in registration command

https://gerrit.wikimedia.org/r/752138

Change 751452 merged by Jelto:

[operations/puppet@production] P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics

https://gerrit.wikimedia.org/r/751452

GitLab Runner in eqiad and codfw export Prometheus metrics now.
I created a GitLab CI overview dashboard in Grafana. This dashboard also links to a new GitLab Runner detail dashboard.

I couldn't find a good GitLab Runner/CI dashboard online which is not outdated or doesn't require an additional exporter.

I've done some more evaluation and testing around the trusted GitLab Runners. For that I created a project under the /repos group to have access to the Shared Group runners in WMCS. I also allowed this project explicitly to use the Trusted Runners. I was able to execute un-reviewed changes on the Shared Runners in WMCS and reviewed changes on the Trusted Runners. I tried to document the current concept in GitLab/Gitlab_Runner#Access_and_permission_model (and also drew a overview diagram).

I also done some testing for special edge cases, like change the gitlab-ci.yml in a feature branch, forking a project or remove certain rules/tags from CI jobs. For now it seems a proper separation of unreviewed and reviewed code changes/jobs is possible.

However there are two big implications when allowing projects access to the Trusted Runners:

  • Main branch has to be protected
  • People with maintainer permissions for that project can execute code to some extend on WMF infrastructure. So access to project maintainer permission has to be controlled

For the first topic I also thought about some kind of check script, which automatically removes the trusted Runners from projects with unprotected main branches. There could also be a script to allow projects to the Trusted Runners, which could first protect the branch automatically.

I'm going to do some more testing and also try to get Security Team on board for a review once we came up with a finished concept around the Trusted Runners.

Icinga downtime set by jelto@cumin1001 for 6:00:00 1 host(s) and their services with reason: move gitlab-runner1001 to new ganeti row

gitlab-runner1001.eqiad.wmnet

cookbooks.sre.hosts.decommission executed by jelto@cumin1001 for hosts: gitlab-runner1001.eqiad.wmnet

  • gitlab-runner1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

Change 757378 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: update MAC address of gitlab-runner1001

https://gerrit.wikimedia.org/r/757378

Change 757378 merged by Jelto:

[operations/puppet@production] install_server: update MAC address of gitlab-runner1001

https://gerrit.wikimedia.org/r/757378

Change 759254 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: execute gitlab-runner as non-root

https://gerrit.wikimedia.org/r/759254

Change 759254 merged by Jelto:

[operations/puppet@production] gitlab_runner: execute gitlab-runner as non-root

https://gerrit.wikimedia.org/r/759254

Change 768040 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: fix service definition

https://gerrit.wikimedia.org/r/768040

Change 768040 merged by Jelto:

[operations/puppet@production] gitlab_runner: fix service definition

https://gerrit.wikimedia.org/r/768040

Change 768683 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add gitlab-runner to docker group, change folder permissions

https://gerrit.wikimedia.org/r/768683

Change 768743 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] isystemd::sysuser: create option to add additional groups to user

https://gerrit.wikimedia.org/r/768743

Change 768743 merged by Jelto:

[operations/puppet@production] systemd::sysuser: create option to add additional groups to user

https://gerrit.wikimedia.org/r/768743

Change 768683 merged by Jelto:

[operations/puppet@production] gitlab_runner: add gitlab-runner to docker group, change folder permissions

https://gerrit.wikimedia.org/r/768683

Change 769065 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add dedicated service unit file

https://gerrit.wikimedia.org/r/769065

Change 769065 merged by Jelto:

[operations/puppet@production] gitlab_runner: add dedicated service unit file

https://gerrit.wikimedia.org/r/769065

Change 769737 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: cleanup service unit file

https://gerrit.wikimedia.org/r/769737

Change 769737 merged by Jelto:

[operations/puppet@production] gitlab_runner: cleanup service unit file

https://gerrit.wikimedia.org/r/769737

Change 769968 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: restrict docker traffic with additional ferm rules

https://gerrit.wikimedia.org/r/769968

Change 769968 merged by Jelto:

[operations/puppet@production] gitlab_runner: restrict docker traffic with additional ferm rules

https://gerrit.wikimedia.org/r/769968

Change 770891 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add missing hiera entry for WMCS

https://gerrit.wikimedia.org/r/770891

Change 770891 merged by Jelto:

[operations/puppet@production] gitlab_runner: add missing hiera entry for WMCS

https://gerrit.wikimedia.org/r/770891

Change 770893 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add missing hiera entry for WMCS

https://gerrit.wikimedia.org/r/770893

Change 770893 merged by Jelto:

[operations/puppet@production] gitlab_runner: add missing hiera entry for WMCS

https://gerrit.wikimedia.org/r/770893

Change 771633 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: remove duplicate ferm rule for AAAA

https://gerrit.wikimedia.org/r/771633

Change 771633 merged by Jelto:

[operations/puppet@production] gitlab_runner: remove duplicate ferm rule for AAAA

https://gerrit.wikimedia.org/r/771633

I mirrored wmf-sre-laptop to GitLab and created a very basic proof-of-concept CI to build the Debian package on Trusted Runners. The current implementation has limitations and is not complete. I created T304491 to further discuss the whole topic of Debian package builds on GitLab CI, as this is a bit out of scope for this task.

I'll proceed with polishing the documentation about the Trusted Runner setup and try to wrap my head around Docker image builds on Trusted Runners.

Change 773746 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add option to drop Docker capabilities

https://gerrit.wikimedia.org/r/773746

Change 775808 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: overwrite default service unit file

https://gerrit.wikimedia.org/r/775808

Change 775808 merged by Jelto:

[operations/puppet@production] gitlab_runner: overwrite default service unit file

https://gerrit.wikimedia.org/r/775808

Change 775815 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: override User of default service unit file

https://gerrit.wikimedia.org/r/775815

Change 775815 merged by Jelto:

[operations/puppet@production] gitlab_runner: override User of default service unit file

https://gerrit.wikimedia.org/r/775815

Change 775821 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: override ExecStart of default service unit file

https://gerrit.wikimedia.org/r/775821

Change 775821 merged by Jelto:

[operations/puppet@production] gitlab_runner: override ExecStart in service unit for non-root

https://gerrit.wikimedia.org/r/775821

Change 790369 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: move metrics listen_address to global section

https://gerrit.wikimedia.org/r/790369

Change 790369 merged by Dzahn:

[operations/puppet@production] gitlab_runner: move metrics listen_address to global section

https://gerrit.wikimedia.org/r/790369

This comment was removed by Jelto.

Trusted Runner automation and access request

I added a first version on how to get and manage access to Trusted Runners. I created /repos/releng/gitlab-trusted-runner.

This repo is used as a central project were Trusted Runners will get registered (they were registered on a private project test project of mine before, I'll change the registration_token in private puppet soon). Furthermore the repo also contains a configuration file and a script to automate access to the Trusted Runners.

projects.json contains a list of authorized projects:

{
  "182": {
    "name": "gitlab-runner-test",
    "reason": "To verify and test Trusted GitLab Runners"
  },
  "339": {
    "name": "gitlab-trusted-runner",
    "reason": "Root/registration project for Trusted Runners"
  }
}

If the script is executed, you can see a (color!) diff with the new configuration:

add-project.py diff

gitlab-runner2001.codfw.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner1001.eqiad.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner2002.codfw.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner2003.codfw.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner2004.codfw.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner1002.eqiad.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner1003.eqiad.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

gitlab-runner1004.eqiad.wmnet:
+project with id 339 repos / releng / Gitlab Trusted Runner
+project with id 182 repos / releng / Gitlab Runner Test

If that looks good, you can execute add-project.py apply to authorize the project.

I'm planning to also create a CI, so the complete review and authorization workflow is inside of GitLab.

Further work

I'm planning to create non-blocking subtasks for

  • evaluate if containers can be executed as non-root
  • evaluate if certain capabilities can be dropped

This would be a useful addition to Trusted Runners but not strictly needed. But then we can move forward with making the Trusted Runners available for first projects.

I created CI jobs to assign authorized projects to Trusted Runners. This CI job uses the same script mentioned above. See /releng/gitlab-trusted-runner/gitlab-ci.yml.

The CI jobs need a global/personal access token, exposed in the variable $GITLAB_TOKEN. I tested the script with a project access token, but this token doesn't has enough permissions. I tested the Job successfully with a temporary personal access token, which is revoked again. I'm not sure about what degree of automation we aim for. We can create a access token with global read/write permission and store it in a protected CI variable.

Another approach could be to not store the access token in GitLab. Then the token has to be pasted into GitLab every time we run the job. This can be done in the Run pipeline button. With this approach we wouldn't be able to use scheduled ci jobs.

@brennen @Dzahn @Arnoldokoth : I would like to get some feedback what you think about putting GitLab admin credentials into automated CI jobs.

@Jelto - From a security perspective, as long as $GITLAB_TOKEN's value is never disclosed in any public, CI-related output and is only configurable by trusted Gitlab users, using it in this manner should be low risk. Inputting the value every time a CI job is triggered or run would likely only be marginally more secure, though would also be more prone to human error and obviously doesn't work for merge requests and pushes (but it sounds like the latter concern isn't that big of a deal in this case).

Change 827456 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/cookbooks@master] sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners

https://gerrit.wikimedia.org/r/827456

Change 827456 merged by jenkins-bot:

[operations/cookbooks@master] sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners

https://gerrit.wikimedia.org/r/827456

Change 830189 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/cookbooks@master] sre.gitlab.reboot-runner: fix pre_scripts call

https://gerrit.wikimedia.org/r/830189

Change 830189 merged by jenkins-bot:

[operations/cookbooks@master] sre.gitlab.reboot-runner: fix pre_scripts call

https://gerrit.wikimedia.org/r/830189

Change 831481 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: allow Trusted Runners to access wikimedia docker-registry

https://gerrit.wikimedia.org/r/831481

Change 831481 merged by Jelto:

[operations/puppet@production] gitlab_runner: allow Trusted Runners to access wikimedia docker-registry

https://gerrit.wikimedia.org/r/831481

Some more explanation to the above edit:

Further security hardening of Docker daemon got a dedicated task T320411. Running Docker as non-root could enhance security even more but we should evaluate if that is needed at the moment.

I unchecked the firewall configuration because the currently configured firewall rules don't have the desired effect/are not working. This has to be fixed first until we can proceed with opening the Trusted Runners even more.

I also added buildkit in the list as an option to build docker images, because we are quite close of doing that (one test images was build and published already).

Change 841910 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: restrict all internal traffic, not only TCP

https://gerrit.wikimedia.org/r/841910

Change 841912 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add webproxy to allowed_services

https://gerrit.wikimedia.org/r/841912

While doing more debugging around the firewall rules for Trusted Runners I found out the firewall mostly work, as defined in puppet.

The Trusted Runners use a more restrictive firewall setting, which configures the additional ferm rule docker-default-reject. This rule consists of daddr 10.0.0.0/8 proto tcp REJECT;.

However most of my test scripts used ICMP/ping instead of a TCP protocol. So it seemed the firewall has not the desired effect of blocking access to the private network (10.0.0.0/8). TCP access to internal services is blocked properly with the existing rule (for example curling a svc.eqiad.wmnet service).

In the above change I extended the firewall rule to all protocols instead of TCP only: daddr 10.0.0.0/8 REJECT;. This should make verifying this rule easier and prevent any kind of ICMP/UDP port scans and exploration attacks.

Change 841910 merged by Dzahn:

[operations/puppet@production] gitlab_runner: restrict all internal traffic, not only TCP

https://gerrit.wikimedia.org/r/841910

Change 841912 merged by Dzahn:

[operations/puppet@production] gitlab_runner: add webproxy to allowed_services

https://gerrit.wikimedia.org/r/841912

Firewall issues are solved with the patch above. So the Trusted Runners are functional now, including buildkitd support (T308271). A proof of concept Debian build and Docker Image build was successful.

There is a related tasks for further hardening of the Docker daemon T320411. Also some work is needed around more standardization like T320730, T304491 and T286958. But that's outside of this specific task. So I'm going to close this one.

Change 773746 abandoned by Jelto:

[operations/puppet@production] gitlab_runner: add option to drop Docker capabilities

Reason:

not needed at the moment

https://gerrit.wikimedia.org/r/773746