Page MenuHomePhabricator

Investigate moving PAWS to magnum
Closed, ResolvedPublic

Description

Right now there are some custom bits of PAWS in relation to k8s, this keeps us stuck on our current deploy of k8s. Ideally we would be able to deploy PAWS to any vanilla k8s cluster. Investigate that.

Event Timeline

Chicocvenancio subscribed.

I think volume persistence is the biggest blocker here, right?

Yes, I believe that would be the largest obstacle. And the bulk of the code that I was working on in
https://github.com/toolforge/paws/compare/T308873
The basic idea is that we would still use nfs, but k8s would do the mounting of it rather than the worker nodes. Thus we're more k8s native, and could, in concept, lift up and move to a different k8s cluster. I got this working in a devstack deployment. Though it doesn't really test the main detail, does this actually mount our cloud vps nfs (I made a local, though separate from k8s, nfs to test). We're starting to tinker with magnum in the codfw1dev, but it is not working yet to test there.

There is a functional prototype in codfw1dev. Need to hack on your hosts file to get it working though.

Prototype exists in parallel with prod paws in magnum. Magnum cluster was deployed with:
openstack coe cluster create rook1 --cluster-template core-34-k8s21-100g --master-count 1 --node-count 1 --floating-ip-disabled
kube config file generated with:
openstack coe cluster config rook1 --dir /tmp/
This is limited to things that can reach the cluster directly (or applying --insecure-skip-tls-verify to kubectl requests). Probably not a problem as this can probably be managed from the bastion, and can certainly be managed from another node in a project.

nginx ingress is needed as described in https://kubernetes.github.io/ingress-nginx/deploy/ though this deploys an lb type, which you can still use by putting the port in haproxy, though this isn't ideal, additional work is needed to identify how to move this to a native nodeport as the current cluster uses.

We will need to add the following annotation
kubernetes.io/ingress.class: "nginx"
on various ingress entries. Might also need:

spec:
  ingressClassName: nginx

At this point I pointed the extra floating IP in paws to a test haproxy instance, and the haproxy at the magnum cluster. This seems to work (though could easily be knocked over, it is very small) by editing the hosts file:

185.15.56.58 hub.paws.wmcloud.org
185.15.56.58 paws.wmcloud.org
185.15.56.58 paws.wmflabs.org
185.15.56.58 public.paws.wmcloud.org
185.15.56.58 paws-public.wmflabs.org

(Don't rely on this working, I remove it regularly for tinkering.)

The remaining notes are old, but were used, need additional research on what of these steps is needed, further elaboration on some parts of this in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Devstack_magnum/PAWS_dev_devstack

kubectl config set-context --current --namespace=prod
kubectl rollout restart -n kube-system daemonset.apps/kube-flannel-ds # not sure if this is needed any longer

cd paws
cd cloud-provider-openstack
kubectl create -f manifests/cinder-csi-plugin/csi-secret-cinderplugin.yaml # not sure if this is needed any longer
kubectl -f manifests/cinder-csi-plugin/ apply  # not sure if this is needed any longer

cd ..
kubectl apply -f sc.yaml # not sure if this is needed any longer
cd <paws git directory>
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm dep up paws/
kubectl create namespace prod
helm install paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --timeout=50m
kubectl apply -f manifests/psp.yaml

So long as we like this solution as is we would need to test a magnum upgrade, for which we will need to be on openstack yoga, as xena appears to be limited to k8s 1.21

cluster template created with:

openstack coe cluster template create core-34-k8s21-100g --image magnum-fedora-coreos-34 --external-network wan-transport-eqiad --fixed-network lan-flat-cloudinstances2b --fixed-subnet cloud-instances2-b-eqiad --dns-nameserver 8.8.8.8 --network-driver flannel --docker-storage-driver overlay2 --docker-volume-size 100 --master-flavor g3.cores1.ram2.disk20 --flavor g3.cores1.ram2.disk20 --coe kubernetes --labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true --public

Upgrading the cluster

openstack coe cluster upgrade rook3 core-34-k8s22-100g
This updates the control node, though worker doesn't, cluster status marked as upgrade failed, gives:
Upgrading a cluster when status is "UPDATE_FAILED" is not supported (HTTP 400) (Request-ID: req-5f1a6d9f-6331-4d12-8be9-d379e6cec237)
on retrying upgrade command. The following clears upgrade failed and returns status to UPDATE_IN_PROGRESS:
openstack coe cluster resize <cluster name> <current number of worker nodes>
Eventually it marks "UPDATE_COMPLETE" but it is only the control node that is updated, not the worker. Running the upgrade again doesn't seem to do anything. Scaling the cluster to 0 workers, then back to one seems to get the cluster all up
graded. Though the haproxy needs updated to point to the new worker nodes at this point.

The value of this upgrade method is somewhat questionable when compared to deploying a new cluster and redeploying PAWS to it, as the latter is fairly trivial. The deploying a new cluster method also has no downtime (Though in PAWS case may cause existing auth sessions to get confused, requiring user to have to log out and back in), and tests disaster recovery on each upgrade.

rook renamed this task from Investigate moving PAWS to magnum or other k8s to Investigate moving PAWS to magnum.Jan 10 2023, 3:48 PM