Page MenuHomePhabricator

Investigate WMCS Magnum for GitLab runners
Closed, ResolvedPublic

Description

We run a Kubernetes cluster on Digital Ocean which runs testing workloads for GitLab CI on https://gitlab.wikimedia.org.

But accessing WMF infrastructure (either to call a MediaWiki api, Gerrit, or GitLab itself) is difficult from Digital Ocean due to rate-limiting and bot protection. Users are bumping up against this problem, too.

These problems go away if we use internal, WMF infrastructure.

This is a task to investigate running GitLab CI Kubernetes cluster on WMCS magnum.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Draft: wmcs: Support WMCS Magnum as a k8s providerrepos/releng/gitlab-cloud-runner!551dduvallspike/wmcs-magnummain
Customize query in GitLab

Event Timeline

Tofu seems to be the nicest way to deploy a Magnum cluster. The https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning config is the most recent config I have been working on that includes Magnum/Kubernetes things.

One of the things needed is a Developer account and credentials to act as the service account for running Tofu. https://ldap.toolforge.org/user/zuuldevopsbot is an example of such an account. Using the release-engineering.$SUBPROJECT@toolforge.org email pattern means that all of the fine folks who are maintainers of https://toolsadmin.wikimedia.org/tools/id/release-engineering will get an email when something emails the bot. See T396902: Create service account user for OpenTofu automation for a past task for this sort of thing. T396247: Set up new project for Zuulv3+ pre-merge and non-image-build workloads and subtasks may have other interesting bits of information.

With some moderate refactoring (see https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/551), gitlab-cloud-runner has been successfully provisioned using WMCS Magnum. From the commit message:

  1. buildkit: Allow configuration of storage class and whether autoscaling should be enabled (we won't need to autoscale on WMCS). Also remove the configuration of a s3 based cache which has never been utilized.
  2. gitlab: Remove node selector/tolerations and parameterize s3 cache server configuration.
  3. digitalocean: Define outputs for configuration that can vary between k8s providers such k8s auth info, s3 server, and ingress cluster IP. Remove unused variables.
  4. wmcs: Introduce new module for WMCS Magnum based k8s provisioning. Outputs are all consistent with the digitalocean module so either can be used in the main cluster config.
  5. externaldns: Refactor to use the newer version.
  6. externaldns-designate: New module for DNS management via OpenStack Designate. This gives us host names for externally facing services (e.g. registry.gitlab-runners-staging.wmcloud.org).
  7. cluster: Select k8s provider module based on new cluster_provider variable and set a cluster local for provider module outputs. Refactor old references to digitalocean properties and use local.cluster instead. Move provider specific resources into wmcs.tf and digitalocean.tf.

Performance is comparable to the existing DO runners/buildkitd when using the new 4xiops volumes. We will also want to have node flavors with better iops in a production cluster. Multiple setup/teardown cycles show the Tofu based provisioning to be fairly reliable or at least as reliable as Tofu provisioning of DO resources.

Awesome work @dduvall, very excited to see this working! <3