The Magnum based deployment will have to coexist alongside the statically provisioned Docker-based runners for a period, so we'll need additional quota for at least a few nodes. We should look at max node usage in DigitalOcean over the last 6 months for a good baseline of compute needs for the cluster.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| In Progress | None | T416256 [Epic] Migrate gitlab-cloud-runner to WMCS | |||
| In Progress | dduvall | T416264 Request additional compute/storage/IP quota in gitlab-runners for Magnum | |||
| Resolved | dcaro | T418813 Quota increases for gitlab-runners |
Event Timeline
I'm being a little squishy with the numbers here, but according to the second highest peak of kube_node_status_allocatable from grafana.cloud.releng.team over the past 3 months, this is where we might want to start with _additional_ CPU/memory quotas (which will be the upper bound of our cluster):
| cpu | memory |
|---|---|
| 74.9 | 289.8Gb |
Additional napkin math:
Instances
The g4.cores8.ram32.disk20 instance flavor seems like the best fit based on what we're running in DO. Dividing that into the cpu/memory numbers and adding 2 master node instances (math.ceil(max(74.9 / 8, 289.8 / 32)) + 2 -> 12) we get a 12 instance increase for our new quota.
Volumes
I think the most flexible approach to node storage is going to be volume based. We can use the newish fast-iops cinder volume type via the docker_volume_type Magnum label. So that's 10 per worker instance.
We also have to consider the volumes managed by k8s for:
- buildkitd (the metrics for the past 90 days show 6 peak replicas of buildkitd so that's 6 volumes)
- reggie (1 volume)
- dockerhub-mirror (1 volume)
That's 18 additional volumes needed.
Volume storage
Our DO cluster reports about 400-500Gi disk usage on all nodes at the peak times. In addition to that, we need to account for the volumes managed by k8s (see above).
- buildkitd (6 x 40Gi = 240Gi)
- reggie (1 x 100Gi)
- dockerhub-mirror (1 x 50Gi)
That's 890Gi of volume storage needed. This is hefty. If we did use local instance storage instead, we could avoid 500Gi of this quota.
@Andrew is it possible to have an instance flavor with a 50Gi 4xiops root volume? If so, we might be able to avoid using cinder volumes for the Magnum node storage and just use local instance storage.
Floating IPs
We'll need at least 1 floating IP for the ingress gateway. We can rely on ssh tunneling to a bastion for access to the k8s endpoint from local systems.

