Page MenuHomePhabricator

Re-consider setting up a Kubernetes cluster on the Beta cluster
Open, Needs TriagePublic

Description

In T220235: Migrate Beta cluster services to use Kubernetes it was decided to use plain Docker VMs for services that run on the production Kubernetes cluster. Given that the core MediaWiki application is moving to Kubernetes in the future (MW-on-K8s) and we have more and more services that MediaWiki depends on I believe it's a good idea to at least revisit that decision.

Beta cluster currently runs some services in Docker containers dedicated VMs (deployment-docker-*) and a few in the legacy deployment-sca[01-02] cluster running Jessie and with Puppet totally broken due to production setup changes. New services are deployed and existing ones updated rarely if at all.

Resource and implementation wise I am unfortunately not sure how feasible this is. The project is almost out of Cloud VPS quota (T257118) plus setting up and maintaining a Kubernetes cluster and services running in it requires a considerable amount of time which I'm not sure who would put it into this project.

Toolforge/PAWS kubernetes clusters use HAProxy with keepalived for control plane load balancing. We could also set up prod-like LVS on beta (T196662) because that's what prod uses for kubernetes control planes (afaik, please correct if I'm wrong) and since I'd like to get LVS set up eventually that would avoid creating dedicated haproxy vms for this, even if it that would be more work to set up.

Related Objects

Event Timeline

There is some discussion at T215217#6610236 that is related to this.

I chatted about this on IRC with some people some time ago, the main takeaways basically were:

  • Something needs to be done at some point, since a) things will break when MW-on-K8s becomes reality and b) the current solution for services is not ideal
  • We're not sure if we should improve the current docker thing or to make a k8s cluster and tooling for it
  • The overhead for a small cluster isn't that big, but that wouldn't be prod-like(TM)
    • <+bd808> it's pretty minimal. for PAWS we are running the etcd nodes and k8s control nodes collapsed together. So it's like 3 small instances (plus a proxy, can be just a haproxy node or two or a full-blown LVS setup) https://openstack-browser.toolforge.org/project/paws
  • setting up a cluster isn't a problem, building and maintaining the tooling to semi-automatically maintain it and the services running in it are a problem
  • <Krenair> I suspect to make informed decisions someone might need to sit down and figure out how easy it is to take prod's k8s setup and apply the equivalent inside labs, and figure out what ongoing maintenance is needed, what is needed to keep up with prod, etc.

T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu resulted in https://gitlab.wikimedia.org/cloudvps-repos/deployment-prep/tofu-provisioning which currently provisions a minimal Kubernetes cluster via OpenStack Magnum. Additional work is needed to figure out how to provide ingress to this cluster and to understand what sorts of changes would be needed to use https://wikitech.wikimedia.org/wiki/Kubernetes/Deployment_Charts to provision services there.