Page MenuHomePhabricator

Quota increases for gitlab-runners
Closed, ResolvedPublic

Description

Project Name: gitlab-runners
Type of quota increase requested: cpu/ram/disk/instance count/volumes/volume storage/floating ip
Reason: We are migrating gitlab-cloud-runner managed infra from DigitalOcean to Magnum on WMCS. See T416264#11659677 for calculations. Note that these quota increases are substantial due to the need to have the existing Docker-based runners in gitlab-runners and the new Magnum/k8s based runners in parallel for a time. The plan is to retire the former once the migration is complete and stable, at which point we can reduce the quotas.
Amounts to increase:

  • instances: +12 (10 worker nodes + 2 master nodes)
  • cpu: +84 (8x10 for worker nodes + 2x2 for master nodes)
  • memory: +328G (32x10 for worker nodes + 4x2 for master nodes)
  • volumes: +18
  • volume storage: +890G (500G of this is due to the need for faster/larger worker node volumes which only seems possible in Magnum via cinder)
  • floating IP: +1 (for ingress gateway and its associated DNS records that are managed by externaldns in the cluster)

Event Timeline

Given the size of the request this will need a discussion in the next WMCS team sync up on Thu., As I'm not aware of any prior coordinated plans for this specific migration, if there was any could you please link them here to add additional context?

@dduvall I'm happy to see that your experiments with Magnum went well (I remember we discussed this in SF back in 2023!) and you're planning to expand its usage. At the same time, Magnum is still in our "1 out of 3 stars" support level, and we are also planning a major driver upgrade (T393782: Investigate new Magnum drivers) that might cause some disruption.

I don't want to curb your enthusiasm, but before moving forward let's make sure both teams are aligned on the reliability and support expectations. :)

As @Volans wrote, we'll discuss this in the next WMCS meeting, we can also set up a cross-team meeting if it's useful.

/cc @Andrew

Sounds good to me. Our planned Zuul migration also involves using Magnum for untrusted workloads (see T396936 and related tasks), so yes let's have a cross-team meeting to sync up on expectations. cc @bd808, @thcipriani

The background here:

  • We've been running out DO cluster since 2021 or so and we've been keen to get to a platform that is supported by SRE.
  • We talked with WMCS about this back in 2023 and chat has been ongoing in various tickets since then.

Whenever we've talked about Magnum with WMCS folks the messaging is that it exists, but it's seldom used and there are some rough edges (which jibes with what the docs say, too).

Happy to coordinate more on this/meet with folks as needed—who should we talk to?

Hey folks, sorry about the not-very-coherent response on this. The bottom line is that compute+storage resources are not an issue, we can definitely provide what you need.

The thing that is in flux our commitment to magnum:

  1. In the interest of 'doing fewer things' our team is discussing whether or not to drop magnum support and push teams over to a puppet/k3s solution. I don't know that we'll actually do that, but I suggest you not get invested in either system until we've had time to properly deliberate.
  1. We have a new rev of Magnum (using capi T393782) that have high hopes for, but it will require different templates from the current (heat-based) magnum drivers running in eqiad1. So that's another reason for you to not start building something on magnum (w/Heat) today. The new capi drivers are already in place in codfw1dev, and I'd love it if someone wanted to give that a test-drive, so please let me know if you're interested in/willing to try tat.

I am just now back from vacation and am also sick, and we're also starting a new manager in 10 days. So there is not a ton of brain-space to get you unstuck this week, but feel free to nag and re-nag over the coming weeks until we get you some clear answers.

Thanks for this answer @Andrew! We should talk more when you're not sick (shoo! :)) I'm interested in figuring out the right path here with the new Magnum and the vision of a puppet/k3s setup.

We've been supporting GitLab CI for years—half on WMCS and half on DO. The only reason for DO is the niceties of managed k8s, but there are so many reasons to prefer WMCS (especially when working with our infra) that it'd be great to figure this out together.

Thanks, @Andrew

Hey folks, sorry about the not-very-coherent response on this. The bottom line is that compute+storage resources are not an issue, we can definitely provide what you need.

The thing that is in flux our commitment to magnum:

Since the resources will be needed regardless of using Magnum (the volume requirements MAY be different without Magnum, not sure) can we go ahead with the quota increase and have the Magnum discussion in a different venue?

  1. In the interest of 'doing fewer things' our team is discussing whether or not to drop magnum support and push teams over to a puppet/k3s solution. I don't know that we'll actually do that, but I suggest you not get invested in either system until we've had time to properly deliberate.

I'm curious what persistent volume and load balancer support would look like using k3s and willing to contribute to discussions elsewhere if that's helpful. (It seems like the cinder CSI plugin is general, but no idea about a CNI for LBs.)

  1. We have a new rev of Magnum (using capi T393782) that have high hopes for, but it will require different templates from the current (heat-based) magnum drivers running in eqiad1. So that's another reason for you to not start building something on magnum (w/Heat) today. The new capi drivers are already in place in codfw1dev, and I'd love it if someone wanted to give that a test-drive, so please let me know if you're interested in/willing to try tat.

I'm hesitant to do that given the current push back and the time I've already invested testing out Magnum, but always open to helping out if the situation changes.

Thanks, @Andrew

Hey folks, sorry about the not-very-coherent response on this. The bottom line is that compute+storage resources are not an issue, we can definitely provide what you need.

The thing that is in flux our commitment to magnum:

Since the resources will be needed regardless of using Magnum (the volume requirements MAY be different without Magnum, not sure) can we go ahead with the quota increase and have the Magnum discussion in a different venue?

Yes. Pinging @dcaro to fill the quota request since he's on clinic duty this week.

Yes. Pinging @dcaro to fill the quota request since he's on clinic duty this week.

I'll take this as a +1 :)

dcaro changed the task status from Open to In Progress.Mar 10 2026, 11:10 AM
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Mentioned in SAL (#wikimedia-cloud-feed) [2026-03-10T11:12:28Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.openstack.quota_increase by 84 cores, 1 floating-ips, 890 gigabytes, 12 instances, 335872 ram, 18 volumes (T418813)

Mentioned in SAL (#wikimedia-cloud-feed) [2026-03-10T11:12:36Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) by 84 cores, 1 floating-ips, 890 gigabytes, 12 instances, 335872 ram, 18 volumes (T418813)

This should be done, let me know if you find any issues!

image.png (565×1 px, 40 KB)