Page MenuHomePhabricator

Limit GitLab shared runners to trusted contributors
Closed, ResolvedPublic8 Estimated Story Points

Description

We opened the GitLab instance to the world earlier today (T288162). On thinking about it, it doesn't seem like limiting CI to images from the Wikimedia registry (T291978) is going to be sufficient to prevent abuse here.

For the moment, I have the shared runners paused.

Background on our setup

  • We intend for everyone with a Wikimedia developer account to be able to sign up and create new projects under their user namespaces. Dev accounts can be freely created by anyone, pretty much.
  • We'd like for known-trusted individual users to be able to make use of shared runners for CI on projects under their user namespaces.
  • We're using groups under people/ to contain individual users, those groups are then added to project groups.
  • Project groups are currently in top-level namespaces like /releng or /security.

Possible solutions

I could use some help thinking through this one. The options I know about so far:

  • Figure out how to get the "User is validated and can use free CI minutes on shared runners." checkbox to do something meaningful. This might be close to ideal, but so far I haven't been able to find much about it outside of the GitLab SaaS platform.
  • Mark all new users as "External" and have a process for unchecking the box. In some ways this may be simplest, but don't really like it - it's just a bad user experience, as you can't create new projects even under your own user when marked external, and last I tried this it wasn't at all obvious why.
    • Worth noting that you can define a regex checked against e-mail addresses for exceptions to this setting, so people with certain domains could be exempt, but that really doesn't do much for everyone outside of a handful of orgs.
  • Limit runners to specific groups instead of providing instance-wide shared runners.
    • Doesn't really solve the problem of offering shared runners for personal-namespace projects.
  • Figure out if running pipelines can be made available only to members of certain people/* groups (as opposed to runners just being available to projects under specific groups), and define a people/trusted-contributors.
  • Do something with pipeline minutes quota
  • Pre-commit hooks may be another option here - seems like KDE is doing something along those lines.

Docs

Event Timeline

brennen set the point value for this task to 4.

Hey @brennen -

The Security-Team will have a chat about this at our upcoming clinic but just some initial thoughts on:

Do something with pipeline minutes quota

There's both a global setting and per-group settings

https://docs.gitlab.com/ee/user/admin_area/settings/continuous_integration.html#shared-runners-pipeline-minutes-quota

This seems like an easy way to reduce risk for performance and Vuln-DoS issues with a hard means of enforcement right out of the box. The obvious question though is how best to manage the inevitability of certain repos/projects requiring more than some baseline allocation of runner minutes. And I'd imagine trying to work through some baseline numbers via existing gerrit/jenkins/zuul data might be rather fraught.

This seems like an absolute necessity unfortunately, note that we already have this in place for Gerrit/Jenkins via the Zuul email allowlist. I do hope we can make this more discoverable and viral somehow.

Mark all new users as "External" and have a process for unchecking the box. In some ways this may be simplest, but don't really like it - it's just a bad user experience, as you can't create new projects even under your own user when marked external, and last I tried this it wasn't at all obvious why.

This might turn out to be necessary anyway - I've seen several discussions about public gitlab instances being very ripe for abuse by spammers. Apparently there are readily available tools that are very effective and through at spamming any publicly facing gitlab for fun and profit.

Here's one reference that was easy to find, but there are lots more:

  1. https://news.ycombinator.com/item?id=24921534

@thcipriani found something that looks like one potential hack-around for disabling CI: External pipeline validation:

You can use an external service to validate a pipeline before it’s created.
This is an experimental feature and subject to change without notice.

GitLab sends a POST request to the external service URL with the pipeline data as payload. The response code from the external service determines if GitLab should accept or reject the pipeline. If the response is:

200, the pipeline is accepted.
406, the pipeline is rejected.
Other codes, the pipeline is accepted and logged.

If there’s an error or the request times out, the pipeline is accepted.

We tested this by using the API to set external_pipeline_validation_service_url to a PHP endpoint which returned a 406 and prevented new pipelines from being created on gitlab-test. The payload the URL receives contains info about the user, including username and e-mail address. A hypothetical script could do a lookup to see if a user is in a people/trusted-contributors group, which could in turn contain both organization-based groups and individuals, and accept or reject on that basis. Could work if it really covers all cases of pipeline creation.

Feels hacky, seems like a stub feature that might get yanked at some point, but might be worth a proof of concept if nothing else seems to make sense.

If there’s an error or the request times out, the pipeline is accepted.

That default sure seems backwards though.

That default sure seems backwards though.

Yeah, agreed. I'd much prefer a fail closed kind of situation.

Proposal

After quite a bit of discussion, here's my proposal for the immediate future of shared job runner allocation:

  1. Create a new top-level group called /wikimedia
  2. Move existing "official" projects - those which already have top-level namespaces such as /releng or /security - to this group, becoming /wikimedia/releng/*, etc.
  3. We plan to migrate existing Gerrit repos to this group
    • For example, mediawiki/core/wikimedia/mediawiki/core
  4. We assign the pool of runners in the gitlab-runners WMCS project to /wikimedia

Rationale

Our current goal is to achieve rough parity with the existing Gerritt/Zuul system for code review and CI.

While providing job runner compute for every member of the technical community with a (Wikimedia-relevant) project in their personal namespace is a desirable outcome, it's not a requirement for migrating our existing projects from Gerrit. Rather than have this block further progress, we should solve the immediate problem of providing shared runners for existing projects.

Additionally, the top-level group is likely to simplify various other access control, resource allocation, and data collection tasks.

Future work

It's a good idea to defer providing shared compute for all users of the GitLab instance, but I don't think we should give up on it entirely. We may well arrive at a solution that works for instance-wide shared runners once we have dedicated Kubernetes resources for GitLab. (See T286958 and some discussion on T291978.)

In the meantime, individual users can bring their own runners.

The bikeshedding part: Naming

/wikimedia was my first thought for a top-level name, and it's defensible, but it has the disadvantage of being a lot of characters, and a full URL like https://gitlab.wikimedia.org/wikimedia/mediawiki/core feels pretty redundant. We next thought of /projects, which seems ideal, but unfortunately that's on GitLab's list of reserved names.

Other suggestions have included:

  • repos
  • code
  • source
  • src

It's worth thinking about this a bit since we'll be looking at (and typing) these URLs for a long time to come. I'd welcome any other concise and semantically useful suggestions.

  1. We assign the pool of runners in the gitlab-runners WMCS project to /wikimedia

So...who will be allowed to create a project under "wikimedia"? If this is going to be the only way to get CI, I'm not sure what the point of "personal" projects are.

If I create a new account, and then submit a patch to some repo under this group, will CI automatically run for my patch? That seems like the scenario we actually want to prevent, and it's not clear it's being addressed by this proposal.

Rationale

Our current goal is to achieve rough parity with the existing Gerritt/Zuul system for code review and CI.

While providing job runner compute for every member of the technical community with a (Wikimedia-relevant) project in their personal namespace is a desirable outcome, it's not a requirement for migrating our existing projects from Gerrit. Rather than have this block further progress, we should solve the immediate problem of providing shared runners for existing projects.

I don't really get this, isn't the whole point of GitLab to move something *better* not just parity? https://www.mediawiki.org/wiki/GitLab_consultation#Why explicitly says the goal is to get people off of third-party code hosting, and that access to self-serve CI is part of that.

Future work

In the meantime, individual users can bring their own runners.

Will this just be something that requires applying a puppet role or is it going to be more involved? If the RelEng team is already figuring out the security requirements, why are we asking/expecting users to reinvent the wheel on that?

And...can I really bring any runner I have sitting in a VPS in some cloud or a box in my house? Or does it need to be in WMCS? And if so, what's the difference between this runner and the shared runner pool?

The bikeshedding part: Naming

/wikimedia was my first thought for a top-level name, and it's defensible, but it has the disadvantage of being a lot of characters, and a full URL like https://gitlab.wikimedia.org/wikimedia/mediawiki/core feels pretty redundant. We next thought of /projects, which seems ideal, but unfortunately that's on GitLab's list of reserved names.

Many people incorrectly conflate "Wikimedia" with "Wikimedia Foundation" and would be turned off by the latter. Some MediaWiki folks also don't like being under the term "Wikimedia". Personally I think the repetition in URLs is unnecessary, and something more neutral and shorter like "code" or "src" is more appealing.

It's worth thinking about this a bit since we'll be looking at (and typing) these URLs for a long time to come. I'd welcome any other concise and semantically useful suggestions.

I would also suggest "r", because then the clone URL is https://gitlab.wikimedia.org/r/mediawiki/core, which is the same as the Gerrit URL but with a different hostname. But I thought this would just be temporary and then we'd eventually be able to move these projects under a proper top-level name?

Rationale

Our current goal is to achieve rough parity with the existing Gerritt/Zuul system for code review and CI.

While providing job runner compute for every member of the technical community with a (Wikimedia-relevant) project in their personal namespace is a desirable outcome, it's not a requirement for migrating our existing projects from Gerrit. Rather than have this block further progress, we should solve the immediate problem of providing shared runners for existing projects.

I don't really get this, isn't the whole point of GitLab to move something *better* not just parity? https://www.mediawiki.org/wiki/GitLab_consultation#Why explicitly says the goal is to get people off of third-party code hosting, and that access to self-serve CI is part of that.

It is, but to ensure that we can complete this project without lingering forever work, we are going to take a staged approach. What Brennen is referring to here is the first stage. It's that "immediate problem" he's referencing. See "Future Work" for the next steps.

The bikeshedding part: Naming

/wikimedia was my first thought for a top-level name, and it's defensible, but it has the disadvantage of being a lot of characters, and a full URL like https://gitlab.wikimedia.org/wikimedia/mediawiki/core feels pretty redundant. We next thought of /projects, which seems ideal, but unfortunately that's on GitLab's list of reserved names.

Other suggestions have included:

  • repos
  • code
  • source
  • src

It's worth thinking about this a bit since we'll be looking at (and typing) these URLs for a long time to come. I'd welcome any other concise and semantically useful suggestions.

Is part of the plan to have a few top-level groups for ACL reasons as well? See eg GNOME's gitlab install with that aspect and the urls they are using, eg: https://gitlab.gnome.org/GNOME/gnome-shell

They (GNOME) use the top-level groups to distinguish "official" vs "community" projects, which might be useful in our case; "wikimedia" vs "personal" might be better? YET MORE BIKESHEDDING!

I personally don't mind the repetition as each time gnome (or wikimedia) is mentioned it is for a distinct purpose/meaning. And the example you've used is probably the worst offender there is (mediawiki core). Which is really more a function of our inability to come up with unique names: https://www.mediawiki.org/w/index.php?title=Wikipmediawiki&redirect=no

Most other projects would be eg: /wikimedia/analytics/aqs/deploy which seems reasonable to me.

mediawiki/extensions/* and mediawiki/skins/* have plenty of projects which have nothing to do with Wikimedia (whether the foundation, the movement or the cluster) and apply completely different CI standards and permission.

Some probably incomplete responses to T292094#7442614 below - I'm going to be AFK for a couple of days for family matters, but wanted to get these down while they're fresh in mind:

So...who will be allowed to create a project under "wikimedia"? If this is going to be the only way to get CI, I'm not sure what the point of "personal" projects are.

The same people currently allowed to request project creation under Gerrit, but it should be much more self-serve since we can allow teams to create projects within a sub-group. As a concrete example, members of people/wmf-team-releng will have access to /[top level group]/releng and can create projects there.

Personal projects (the ones that live under a user's username) are lower utility without a pool of pre-configured job runners available to them, but code hosting & review is still useful on its own.

If I create a new account, and then submit a patch to some repo under this group, will CI automatically run for my patch? That seems like the scenario we actually want to prevent, and it's not clear it's being addressed by this proposal.

My current understanding is that only users with Developer-level access to the project are allowed to run jobs:

https://docs.gitlab.com/ee/user/permissions.html#job-permissions

I think, however, I need to build a working example of exactly what happens here. I'll call that out as a next step here.

I don't really get this, isn't the whole point of GitLab to move something *better* not just parity? https://www.mediawiki.org/wiki/GitLab_consultation#Why explicitly says the goal is to get people off of third-party code hosting, and that access to self-serve CI is part of that.

There's a spectrum of self-serve-ness, for lack of a better term.

We've already established that we probably can't let everyone on the internet run whatever they feel like. If we can let known-trusted groups of people create projects at will, and configure CI in those repos without (for example) editing an 11k line YAML file in a different project, we're already well ahead of the Gerrit/Zuul status quo. As Greg says, this probably needs to be a staged process.

And...can I really bring any runner I have sitting in a VPS in some cloud or a box in my house? ... And if so, what's the difference between this runner and the shared runner pool?

Yeah. The difference is you're responsible for provisioning / maintaining it. Unless there turn out to be security consequences we don't know about yet, this seems like a reasonable facility to offer - and maybe even a useful way for people to donate compute. (I want to be clear that we were a bit surprised by this pattern, but it seems like a default feature of the system.)

I would also suggest "r", because then the clone URL is https://gitlab.wikimedia.org/r/mediawiki/core, which is the same as the Gerrit URL but with a different hostname. But I thought this would just be temporary and then we'd eventually be able to move these projects under a proper top-level name?

Yeah, a few people have suggested this. It's reasonable enough, and certainly concise, but I think we're probably better off with a name that has some clear meaning. I don't expect it to be temporary, since it's very likely to accumulate more uses than just having job runners allocated to it and I don't see much benefit in migrating paths once we have most of our projects living under it.

Bit of a dense book I have written here, so sorry for that

I have been playing around with runners on the WM gitlab install for some time now on the mwcli repo, with custom dedicated runners there.
Some things to note... (ignore me if these were all known)

  • Out of the box CI on runners does NOT run for MRs from forks of a code repository. It seems this is a feature for a different github tier.
  • It is possible to add "developers" to a project, and I imagine to a group, that have the ability to create branches and MRs from said branches, and thus have CI run, but also not have the ability to merge into a repo on certain branches via protection (essentially +2)
  • We will 100% want shared caching setup between our shared runners. Most of the individual docker based caching I have seen is not super smart and ends up wasting a whole bunch of disk.

As far as I can tell the proposal in T292094#7442275 sounds like it makes sense if we are worried about a larger pool of shared runners being DDosed etc.
This should still leave open the possibility to have a smaller pool of runners that would be available to all users for all projects?

If I create a new account, and then submit a patch to some repo under this group, will CI automatically run for my patch? That seems like the scenario we actually want to prevent, and it's not clear it's being addressed by this proposal.

At least from my experience for mwcli, I'd say No CI wouldn't run?
I think this has something to do with access to the runners?
https://docs.gitlab.com/ee/ci/pipelines/merge_request_pipelines.html#run-pipelines-in-the-parent-project-for-merge-requests-from-a-forked-project
Forks of projects for cases like this need access to runners in order to run their CI.
Otherwise the only contribution path is to be added as a developer to the project or group as far as I can see.

Will this just be something that requires applying a puppet role or is it going to be more involved? If the RelEng team is already figuring out the security requirements, why are we asking/expecting users to reinvent the wheel on that?

IMO this is fairly trivial, for mwcli which probably has a more complicated setup that most projects need you can see a guide at https://gitlab.wikimedia.org/releng/cli/-/blob/main/CI.md
But also IMO this does seem like a path forward that would end up in us wasting a whole bunch of CPUS and MEMs sitting out their in cloud VPS land.
I think a better solution would be to have a shared global pool of runners, even if we also have dedicated resources for some other specific groups.

Many people incorrectly conflate "Wikimedia" with "Wikimedia Foundation" and would be turned off by the latter. Some MediaWiki folks also don't like being under the term "Wikimedia". Personally I think the repetition in URLs is unnecessary, and something more neutral and shorter like "code" or "src" is more appealing.

+1
I think wmf could make sense?
I expect we would also end up having a top level wmde perhaps?
I think I would be against r

Another thing worth noting is that as far as I can tell a single runner can only be attached to a single project / group / global pool.
But 1 runner != 1 virtual machine.
This means you can have a single virutal machine with multiple runners defined on it connected to various runner groups etc on gitlab.
That virtual machine across all runners defined on it can still have a global machine concurrency.
The TLDR of what I think this means for us is there is no reason to only have a single top level group that has access to a single group of runners, we can be more freeform.
For example in the wmf and wmde example, they could both share 100% of a single pool of virtual machines, but also one group or the other could also add more runners specific to individual projects or groups.
I don't know how this plays with groups in a hierarchy, but for top level groups etc it certainly sounds appealing.

We may well arrive at a solution that works for instance-wide shared runners once we have dedicated Kubernetes resources for GitLab. (See T286958 and some discussion on T291978.)

I remember a conversation some time ago about using off the shelf compute resources for some CI reasons.
Might this be one of them?

My current understanding is that only users with Developer-level access to the project are allowed to run jobs:

https://docs.gitlab.com/ee/user/permissions.html#job-permissions

I think, however, I need to build a working example of exactly what happens here. I'll call that out as a next step here.

This seems to be the case indeed: https://gitlab.wikimedia.org/releng/ddd/-/merge_requests/7

Notice that even with developer access to the repository, it won't run CI jobs for the MRs submitted by @Mhurd. Unfortunately it doesn't make that clear, it just shows the job as forever pending and the MR is effectively broken and cannot be merged.

Hi releng-team,

I hope this is the right venue to ask for a consultation; please let me know if you'd rather me open a dedicated phab (or touch base elsewhere).

A while ago I started a PoC using Gitlab (https://gitlab.wikimedia.org/gmodena/platform-airflow-dags). The codebase depends on a polyglot stack (jvm+python) with opinionated version requirements. I found that CI Pipelines and Registries (maven, pypi, generic) simplified our development lifecycle, and suited our requirements better than Gerrit.

I appreciate Gitlab is a work in progress (that was factored in when the PoC started), but the UX has been delightful and I'd like to keep using it to host our repo(s). I wonder if you could help me make an educated choice re how to move forward.

Build automation is a success criteria for our PoC. In the absence of CI Pipelines, we are considering the following, ad interim, path:

  1. Host our code base in Gitlab.
  2. Move the repo(s) from an individual user to a team when possible.
  3. Delegate builds to a third party system.
  4. Adjust our flow to some manual work (publish artifacts, check MRs)

For 3: we experimented with mirroring the repo to Github and trigger a CI Action on push. It works, but I'd be open to a better approach. I'm aware that if we bring CI back to Gitlab, we will likely have to provide an internal image.

Do you think this makes sense? Would this plan fit the CI Pipeline rollout timeline discussed in this phab?

If it helps I'd be happy to guinea-pig some of the solutions discussed in this thread.

I also think that wmf would make sense for the top-level name.

Slightly tangential to this question of shared runners, I have today filed a request for permission to create a group runner for our team: T295045: Allow a shared, protected runner for the data-engineering group in GitLab

I believe that this also crosses over with the request from @gmodena above, since the initial use case is also Airflow DAG CI and deployment.

...I'm aware that if we bring CI back to Gitlab, we will likely have to provide an internal image.

I'm not sure that CI is gone, per se, but rather that runners have been sporadically disabled for the gitlab.wikimedia.org installation. The Security-Team has been working to build an initial appsec related ci platform (T289290) and our assumptions rest heavily upon the idea that Gitlab's CI will very much be a reality once the issue of runners - particularly shared/group runners - has been resolved. We've also been working exclusively with images under docker-registry.wikimedia.org and attempting not to bloat them too much while the issues within T291978 are still under discussion.

Mentioned in SAL (#wikimedia-releng) [2021-11-09T19:36:30Z] <brennen> gitlab: creating top-level /repos group (T292094)

Do you think this makes sense? Would this plan fit the CI Pipeline rollout timeline discussed in this phab?

If it helps I'd be happy to guinea-pig some of the solutions discussed in this thread.

FWIW I was able to set up gitlab ci on https://gitlab.wikimedia.org/releng/ddd/ with a digital ocean instance running the gitlab-runner.

Also it would seem that wikimedia gitlab runners aren't too far from being usable. @brennen probably has a better idea of timelines than I do.

A quick update here: In the short term (i.e. today and this week), we're going the path of migrating projects to a top-level namespace and assigning the existing WMCS pool of runners there. After discussion today with SRE folks, we also expect to build:

  • Trusted runners in production for things producing an artifact that reaches production.
  • Untrusted and variously constrained runners, probably on a 3rd-party host, to handle user-level projects and merge requests from forks.
    • This will be experimental, and a bunch of details will need to be worked out.

For very specialized runners, I do suspect a bring-your-own approach is best. I'm happy to help folks migrate projects to groups within /repos and assign access as needed so those can be configured, although T295045 seems like it touches things I'm not qualified to comment on.

Mentioned in SAL (#wikimedia-releng) [2021-11-09T22:29:07Z] <brennen> gitlab: finished moving releng, security, data-engineering projects to /repos (T292094)

Mentioned in SAL (#wikimedia-releng) [2021-11-09T22:48:11Z] <brennen> gitlab-runners: unregistered instance-wide shared runners, re-registered to /repos (T292094)