Page MenuHomePhabricator

Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks
Open, MediumPublic5 Estimated Story Points

Description

I'm filing this as a placeholder followup from T292094, where I mentioned:

After discussion today with SRE folks, we also expect to build:

  • Untrusted and variously constrained runners, probably on a 3rd-party host, to handle user-level projects and merge requests from forks.
    • This will be experimental, and a bunch of details will need to be worked out.

Open problems:

  • provision managed Kubernetes cluster on Digital Ocean
  • configure Kubernetes executor using gitlab/gitlab-runner helm chart
  • create ci-pipeline for both above (repos/releng/gitlab-cloud-runner/)
  • reduce timeout for CI jobs to 10m for Cloud Runners
  • reduce amount of CPU/MEM available to CI jobs
  • activate autoscaling for Kubernetes Node pool
  • create quota for available CI minutes not possible in free tier.
  • ...?
  • create some kind of alerting or monitoring
  • restrict allowed images?
  • open Cloud Runners instance wide
  • announce Cloud Runners?

Event Timeline

brennen triaged this task as Medium priority.Dec 9 2021, 7:52 PM
brennen set the point value for this task to 5.

@thcipriani we discussed creating this task yesterday; I'd forgotten it already existed.

Also cc: @Jelto, per discussion at GitLab IC sync.

brennen renamed this task from Provision untrusted GitLab job runners to handle user-level projects and merge requests from forks to Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks.Feb 17 2022, 8:03 PM

We did some experimentation with how we'd like to run very untrusted workloads recently.

Basics of the current status quo

  • Workloads come from randos on the internet
  • Current (WMCS) runners (as of Feb 2022) are non-ephemeral and shared between jobs
  • Test jobs (pre-merge) share a network space
  • Workloads run in docker containers (mostly)

Problems

  • Workloads are untrusted and, worst case, could be DDoS attacks or cryptominers
  • Runners, worst case, could be poisoned by untrusted workloads and become part of a botnet to DDoS attack or mine crypto
  • Sharing a network namespace with untrusted workloads could result in network problems from trusted jobs, or targetting by malicious workloads

The Idea

  • The need for trust should be lower
    • network isolate untrusted workloads
    • untrusted workloads should be time-boxed and run on ephemeral, use-once, boxen (k8s runners)

Workflow

Gitlab_new_volunteer_UML.png (664×1 px, 71 KB)

^ above you see the workflow

  1. Unknown user forks
  2. Their tests run on ephemeral boxen that are network isolated from other workloads
  3. The test results are reported on their patch
  4. They send a merge request, after a trusted contributor merges the code, we now trust it to run on a trusted builder

Experiments in progress

We have a k8s cluster on Digital Ocean that we're using to prove the viability of ^ model. We talked it over with ServiceOps and WMCS and that's a good path for the time-being if everything seems to work correctly. In future, we'll continually evaluate whether a third party cloud is the right place to run this.

We have a k8s cluster on Digital Ocean that we're using to prove the viability of ^ model. We talked it over with ServiceOps and WMCS and that's a good path for the time-being if everything seems to work correctly. In future, we'll continually evaluate whether a third party cloud is the right place to run this.

I created some thoughts in GitLab/Gitlab_Runner/Cloud_Runners and started working on a GitLab project in https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner. The project has CI and Terrform code which provisions and configures the instance-wide Cloud Runners on Digital Ocean. I'm working on a additional CI job currently, to also deploy the Kubernetes Executor automatically using helm. After that step and some tweaking of configuration Cloud Runners should be available.

With this approach it's easy and self-documented to bootstrap the Cloud Runners because there is no need to run Terraform on your local machine. And it's quite neat that GitLab configures it's Runners.

@thcipriani beside your test Kubernetes cluster there is now a cloud-runner cluster with autoscaling (currently min 1, max 2 nodes)

https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner has CI for provisioning the managed Kubernetes cluster and setup of Kubernetes Runner now. Thats mostly done using Terraform and Helm. So we have working Cloud Runners with autoscaling (min 1 and max 2 nodes).

@thcipriani Currently the Cloud Runners are limited to the gitlab-cloud-runner project (meaning nobody can use them effectively). Is there some plan when instance-wide Cloud Runners should be available? Do you think it's reasonable to make them available also during GitLab-a-thon?

I added more restrictive CPU and memory limits to the Cloud Runner configuration (0.1 CPU and 200Mi Memory). I also set the timeout for jobs to 300s which is the minimum.

I've done a quick test with ten parallel CI jobs running stress --cpu 4 (10 is the maximum number of concurrency)
All jobs consumed about 30% of one node CPU in total and got stopped after 10 minutes with

$ stress --cpu 4
stress: info: [448] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
ERROR: Job failed: execution took longer than 10m0s seconds

This settings should make it harder to run resource intensive jobs like mining or malware. Linting and builds should work. 200Mi memory may be a little bit small for certain builds, but we will see how Cloud Runners behave with that settings.

@thcipriani I added some more open topics to the description. Can you take a look? I would like to know what is needed from your perspective until Cloud Runners can be available instance wide.

@thcipriani I added some more open topics to the description. Can you take a look? I would like to know what is needed from your perspective until Cloud Runners can be available instance wide.

Nice! I like the checklist.

I think we're close to being able to announce instance-wide runners—questions and thoughts below.

Quotas

Are CPU/Memory limits and job timeouts sufficient to give us confidence that miners won't find this environment useful?

@Jelto, your stress test seems to imply that that's correct. Is that true?

Monitoring/Alerting

This is the monitoring/alerting I think of as blocking opening the cluster to users.

  • Alerts
    • Cluster is down
    • Jobs are not running
    • Other actionable alerts?
  • Monitoring
    • sustained CPU load (traceable to job?)
    • sustained network use (traceable to job?)
  • Automated action: billing threshold exceeded should stop jobs/runners
    • We need to establish a billing threshold here, is a blocker.

Per job/per repo run time, CPU/mem usage would be interesting to know, but should not be a blocker—a nice-to-have.

Anything I'm missing either as a blocker or a nice-to-have (@brennen or @Jelto)?

Images

@brennen took a crack at allowed images https://gerrit.wikimedia.org/r/724472 — do we need a process to request a new one documented before announcing?