Page MenuHomePhabricator

Request additional compute/storage/IP quota in gitlab-runners for Magnum
Open, In Progress, MediumPublic

Description

The Magnum based deployment will have to coexist alongside the statically provisioned Docker-based runners for a period, so we'll need additional quota for at least a few nodes. We should look at max node usage in DigitalOcean over the last 6 months for a good baseline of compute needs for the cluster.

Event Timeline

I'm being a little squishy with the numbers here, but according to the second highest peak of kube_node_status_allocatable from grafana.cloud.releng.team over the past 3 months, this is where we might want to start with _additional_ CPU/memory quotas (which will be the upper bound of our cluster):

cpumemory
74.9289.8Gb

image.png (274×1 px, 35 KB)

image.png (274×1 px, 41 KB)

dduvall changed the task status from Open to In Progress.Feb 27 2026, 6:56 PM
dduvall claimed this task.
dduvall triaged this task as Medium priority.

Additional napkin math:

Instances

The g4.cores8.ram32.disk20 instance flavor seems like the best fit based on what we're running in DO. Dividing that into the cpu/memory numbers and adding 2 master node instances (math.ceil(max(74.9 / 8, 289.8 / 32)) + 2 -> 12) we get a 12 instance increase for our new quota.

Volumes

I think the most flexible approach to node storage is going to be volume based. We can use the newish fast-iops cinder volume type via the docker_volume_type Magnum label. So that's 10 per worker instance.

We also have to consider the volumes managed by k8s for:

  • buildkitd (the metrics for the past 90 days show 6 peak replicas of buildkitd so that's 6 volumes)
  • reggie (1 volume)
  • dockerhub-mirror (1 volume)

That's 18 additional volumes needed.

Volume storage

Our DO cluster reports about 400-500Gi disk usage on all nodes at the peak times. In addition to that, we need to account for the volumes managed by k8s (see above).

  • buildkitd (6 x 40Gi = 240Gi)
  • reggie (1 x 100Gi)
  • dockerhub-mirror (1 x 50Gi)

That's 890Gi of volume storage needed. This is hefty. If we did use local instance storage instead, we could avoid 500Gi of this quota.

@Andrew is it possible to have an instance flavor with a 50Gi 4xiops root volume? If so, we might be able to avoid using cinder volumes for the Magnum node storage and just use local instance storage.

Floating IPs

We'll need at least 1 floating IP for the ingress gateway. We can rely on ssh tunneling to a bastion for access to the k8s endpoint from local systems.