⚓ T297426 Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks

	Title	Reference	Author	Source Branch	Dest Branch
	Restrict images	repos/releng/gitlab-cloud-runner!2	jhuneidi	T297426	main
	add helm diff and helm install as CI jobs	repos/releng/gitlab-cloud-runner!1	jelto	ci-helm	main

Status	Assigned	Task
Resolved	None	T297426 Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks
Resolved	thcipriani	T308615 Add DigitalOcean resource monitoring for cloud runner nodes
Resolved	dduvall	T322344 Move cloud runner CI jobs to trusted runners
Resolved	• dancy	T327949 Convert runner-1030.gitlab-runners.eqiad1.wikimedia.cloud to an instance-wide shared runner

brennen created this task.Dec 9 2021, 7:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 9 2021, 7:31 PM

brennen triaged this task as Medium priority.Dec 9 2021, 7:52 PM

brennen set the point value for this task to 5.

@thcipriani we discussed creating this task yesterday; I'd forgotten it already existed.

Also cc: @Jelto, per discussion at GitLab IC sync.

brennen renamed this task from Provision untrusted GitLab job runners to handle user-level projects and merge requests from forks to Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks.Feb 17 2022, 8:03 PM

We did some experimentation with how we'd like to run very untrusted workloads recently.

Basics of the current status quo

Workloads come from randos on the internet
Current (WMCS) runners (as of Feb 2022) are non-ephemeral and shared between jobs
Test jobs (pre-merge) share a network space
Workloads run in docker containers (mostly)

Problems

Workloads are untrusted and, worst case, could be DDoS attacks or cryptominers
Runners, worst case, could be poisoned by untrusted workloads and become part of a botnet to DDoS attack or mine crypto
Sharing a network namespace with untrusted workloads could result in network problems from trusted jobs, or targetting by malicious workloads

The Idea

The need for trust should be lower
- network isolate untrusted workloads
- untrusted workloads should be time-boxed and run on ephemeral, use-once, boxen (k8s runners)

Workflow

Gitlab_new_volunteer_UML.png (664×1 px, 71 KB)

^ above you see the workflow

Unknown user forks
Their tests run on ephemeral boxen that are network isolated from other workloads
The test results are reported on their patch
They send a merge request, after a trusted contributor merges the code, we now trust it to run on a trusted builder

Experiments in progress

We have a k8s cluster on Digital Ocean that we're using to prove the viability of ^ model. We talked it over with ServiceOps and WMCS and that's a good path for the time-being if everything seems to work correctly. In future, we'll continually evaluate whether a third party cloud is the right place to run this.

We have a k8s cluster on Digital Ocean that we're using to prove the viability of ^ model. We talked it over with ServiceOps and WMCS and that's a good path for the time-being if everything seems to work correctly. In future, we'll continually evaluate whether a third party cloud is the right place to run this.

I created some thoughts in GitLab/Gitlab_Runner/Cloud_Runners and started working on a GitLab project in https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner. The project has CI and Terrform code which provisions and configures the instance-wide Cloud Runners on Digital Ocean. I'm working on a additional CI job currently, to also deploy the Kubernetes Executor automatically using helm. After that step and some tweaking of configuration Cloud Runners should be available.

With this approach it's easy and self-documented to bootstrap the Cloud Runners because there is no need to run Terraform on your local machine. And it's quite neat that GitLab configures it's Runners.

@thcipriani beside your test Kubernetes cluster there is now a cloud-runner cluster with autoscaling (currently min 1, max 2 nodes)

LSobanski subscribed.May 4 2022, 1:06 PM

https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner has CI for provisioning the managed Kubernetes cluster and setup of Kubernetes Runner now. Thats mostly done using Terraform and Helm. So we have working Cloud Runners with autoscaling (min 1 and max 2 nodes).

@thcipriani Currently the Cloud Runners are limited to the gitlab-cloud-runner project (meaning nobody can use them effectively). Is there some plan when instance-wide Cloud Runners should be available? Do you think it's reasonable to make them available also during GitLab-a-thon?

I added more restrictive CPU and memory limits to the Cloud Runner configuration (0.1 CPU and 200Mi Memory). I also set the timeout for jobs to 300s which is the minimum.

I've done a quick test with ten parallel CI jobs running stress --cpu 4 (10 is the maximum number of concurrency)
All jobs consumed about 30% of one node CPU in total and got stopped after 10 minutes with

$ stress --cpu 4
stress: info: [448] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
ERROR: Job failed: execution took longer than 10m0s seconds

This settings should make it harder to run resource intensive jobs like mining or malware. Linting and builds should work. 200Mi memory may be a little bit small for certain builds, but we will see how Cloud Runners behave with that settings.

thcipriani awarded a token.May 6 2022, 4:33 PM

@thcipriani I added some more open topics to the description. Can you take a look? I would like to know what is needed from your perspective until Cloud Runners can be available instance wide.

In T297426#7917974, @Jelto wrote:

@thcipriani I added some more open topics to the description. Can you take a look? I would like to know what is needed from your perspective until Cloud Runners can be available instance wide.

Nice! I like the checklist.

I think we're close to being able to announce instance-wide runners—questions and thoughts below.

Quotas

Are CPU/Memory limits and job timeouts sufficient to give us confidence that miners won't find this environment useful?

@Jelto, your stress test seems to imply that that's correct. Is that true?

Monitoring/Alerting

This is the monitoring/alerting I think of as blocking opening the cluster to users.

Alerts
- Cluster is down
- Jobs are not running
- Other actionable alerts?
Monitoring
- sustained CPU load (traceable to job?)
- sustained network use (traceable to job?)
Automated action: billing threshold exceeded should stop jobs/runners
- We need to establish a billing threshold here, is a blocker.

Per job/per repo run time, CPU/mem usage would be interesting to know, but should not be a blocker—a nice-to-have.

Anything I'm missing either as a blocker or a nice-to-have (@brennen or @Jelto)?

Images

@brennen took a crack at allowed images https://gerrit.wikimedia.org/r/724472 — do we need a process to request a new one documented before announcing?

bd808 subscribed.May 13 2022, 5:09 PM

jeena updated the task description. (Show Details)May 17 2022, 10:06 PM

LucasWerkmeister mentioned this in R2422:15ade382300d: Move from Diffusion to Wikimedia GitLab.May 22 2022, 1:43 PM

brennen moved this task from Backlog to Radar on the User-brennen board.Aug 2 2022, 2:05 PM

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team (Next).Sep 7 2022, 3:46 PM

Just poking at this with @jeena - basic CPU & bandwidth alerts seem pretty easy:

2022-09-27-15:48:30.png (344×1 px, 48 KB)

...I'm not sure about a billing threshold stopping runners. Owner of account can set up an alert pretty easily for billing exceeding some amount, but I haven't seen anything obvious for stopping services past that point.

(Added above comment before noticing we'd discussed this under T308615: Add DigitalOcean resource monitoring for cloud runner nodes. At any rate, will discuss whether we should pick this back up under next GitLab-related sprint.)

Novem_Linguae subscribed.Nov 15 2022, 2:58 AM

brennen edited projects, added Release-Engineering-Team (GitLab III: GitLab in LA 🪃); removed Release-Engineering-Team (Priority Backlog 📥).Nov 15 2022, 6:15 PM

kostajh subscribed.Nov 21 2022, 10:35 AM

Jelto added a project: collaboration-services.Nov 22 2022, 3:57 PM

Jelto moved this task from Incoming to Consultation on the collaboration-services board.

dduvall closed subtask T322344: Move cloud runner CI jobs to trusted runners as Resolved.Jan 6 2023, 7:41 PM

thcipriani edited projects, added Release-Engineering-Team (Holiday Leftovers 🥡); removed Release-Engineering-Team (GitLab III: GitLab in LA 🪃).Jan 11 2023, 5:53 PM

thcipriani edited projects, added Release-Engineering-Team (GitLab IV: Mise En Place 🍱); removed Release-Engineering-Team (Holiday Leftovers 🥡).Jan 19 2023, 3:47 PM

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team (GitLab IV: Mise En Place 🍱).Jan 19 2023, 5:13 PM

• dancy added a subtask: T327949: Convert runner-1030.gitlab-runners.eqiad1.wikimedia.cloud to an instance-wide shared runner.Jan 25 2023, 8:53 PM

Jelto mentioned this in T325069: Align the GitLab runner tags.Jan 31 2023, 10:15 AM

• dancy closed subtask T327949: Convert runner-1030.gitlab-runners.eqiad1.wikimedia.cloud to an instance-wide shared runner as Resolved.Jan 31 2023, 4:51 PM

thcipriani added a subtask: T328516: Consider enabling distributed caching for GitLab runners.Feb 1 2023, 5:38 PM

Chlod subscribed.Feb 20 2023, 3:12 AM

thcipriani closed subtask T308615: Add DigitalOcean resource monitoring for cloud runner nodes as Resolved.Mar 8 2023, 5:21 PM

There are instance-wide runners in both WMCS as well as a managed k8s cluster running in a managed cluster in digital ocean. Calling this one complete.

Is there documentation or a tutorial about how to use this for an individual repo?

Jelto reopened subtask T327949: Convert runner-1030.gitlab-runners.eqiad1.wikimedia.cloud to an instance-wide shared runner as Open.May 21 2024, 8:28 AM

Jelto closed subtask T327949: Convert runner-1030.gitlab-runners.eqiad1.wikimedia.cloud to an instance-wide shared runner as Resolved.May 22 2024, 3:31 PM

Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks
Closed, ResolvedPublic5 Estimated Story Points
Actions

Description

Details

Related Objects
Search...

Event Timeline

Basics of the current status quo

Problems

The Idea

Workflow

Experiments in progress

Quotas

Monitoring/Alerting

Images

Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forksClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Basics of the current status quo

Problems

The Idea

Workflow

Experiments in progress

Quotas

Monitoring/Alerting

Images

Provision untrusted instance-wide GitLab job runners to handle user-level projects and merge requests from forks
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...