Right now there are some custom bits of PAWS in relation to k8s, this keeps us stuck on our current deploy of k8s. Ideally we would be able to deploy PAWS to any vanilla k8s cluster. Investigate that.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | rook | T280792 Investigate Openstack Magnum | |||
Resolved | rook | T308873 Investigate moving PAWS to magnum | |||
Resolved | rook | T321886 mount nfs directly into pods | |||
Resolved | rook | T326258 k8s 1.22 magnum template for PAWS | |||
Resolved | rook | T326264 Deploy paws to magnum | |||
Resolved | rook | T326257 k8s 1.21 magnum template for PAWS | |||
Resolved | rook | T326260 Normalize PAWS resource usage | |||
Resolved | rook | T326262 Temporary increase of PAWS quota | |||
Resolved | None | T325540 Nodeport for ingress-nginx | |||
Resolved | rook | T325746 ingress-nginx | |||
Resolved | rook | T325812 upgrade jupyterhub chart | |||
Resolved | rook | T326268 New trove db for magnum | |||
Resolved | rook | T326276 Deploy paws dev to codfw1dev | |||
Resolved | Andrew | T326331 Deploy paws-dev trove db | |||
Resolved | rook | T326588 open refine not loading in codfw1dev | |||
Resolved | rook | T326631 Setup nfs for paws-dev | |||
Resolved | rook | T326629 Setup DNS for paws-dev.codfw1dev.wmcloud.org. | |||
Resolved | rook | T326723 env vars for nbserve and renderer requests |
Event Timeline
Yes, I believe that would be the largest obstacle. And the bulk of the code that I was working on in
https://github.com/toolforge/paws/compare/T308873
The basic idea is that we would still use nfs, but k8s would do the mounting of it rather than the worker nodes. Thus we're more k8s native, and could, in concept, lift up and move to a different k8s cluster. I got this working in a devstack deployment. Though it doesn't really test the main detail, does this actually mount our cloud vps nfs (I made a local, though separate from k8s, nfs to test). We're starting to tinker with magnum in the codfw1dev, but it is not working yet to test there.
There is a functional prototype in codfw1dev. Need to hack on your hosts file to get it working though.
Prototype exists in parallel with prod paws in magnum. Magnum cluster was deployed with:
openstack coe cluster create rook1 --cluster-template core-34-k8s21-100g --master-count 1 --node-count 1 --floating-ip-disabled
kube config file generated with:
openstack coe cluster config rook1 --dir /tmp/
This is limited to things that can reach the cluster directly (or applying --insecure-skip-tls-verify to kubectl requests). Probably not a problem as this can probably be managed from the bastion, and can certainly be managed from another node in a project.
nginx ingress is needed as described in https://kubernetes.github.io/ingress-nginx/deploy/ though this deploys an lb type, which you can still use by putting the port in haproxy, though this isn't ideal, additional work is needed to identify how to move this to a native nodeport as the current cluster uses.
We will need to add the following annotation
kubernetes.io/ingress.class: "nginx"
on various ingress entries. Might also need:
spec: ingressClassName: nginx
At this point I pointed the extra floating IP in paws to a test haproxy instance, and the haproxy at the magnum cluster. This seems to work (though could easily be knocked over, it is very small) by editing the hosts file:
185.15.56.58 hub.paws.wmcloud.org 185.15.56.58 paws.wmcloud.org 185.15.56.58 paws.wmflabs.org 185.15.56.58 public.paws.wmcloud.org 185.15.56.58 paws-public.wmflabs.org
(Don't rely on this working, I remove it regularly for tinkering.)
The remaining notes are old, but were used, need additional research on what of these steps is needed, further elaboration on some parts of this in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Devstack_magnum/PAWS_dev_devstack
kubectl config set-context --current --namespace=prod kubectl rollout restart -n kube-system daemonset.apps/kube-flannel-ds # not sure if this is needed any longer cd paws cd cloud-provider-openstack kubectl create -f manifests/cinder-csi-plugin/csi-secret-cinderplugin.yaml # not sure if this is needed any longer kubectl -f manifests/cinder-csi-plugin/ apply # not sure if this is needed any longer cd .. kubectl apply -f sc.yaml # not sure if this is needed any longer cd <paws git directory> helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ helm dep up paws/ kubectl create namespace prod helm install paws --namespace prod ./paws -f paws/secrets.yaml -f paws/production.yaml --timeout=50m kubectl apply -f manifests/psp.yaml
So long as we like this solution as is we would need to test a magnum upgrade, for which we will need to be on openstack yoga, as xena appears to be limited to k8s 1.21
cluster template created with:
openstack coe cluster template create core-34-k8s21-100g --image magnum-fedora-coreos-34 --external-network wan-transport-eqiad --fixed-network lan-flat-cloudinstances2b --fixed-subnet cloud-instances2-b-eqiad --dns-nameserver 8.8.8.8 --network-driver flannel --docker-storage-driver overlay2 --docker-volume-size 100 --master-flavor g3.cores1.ram2.disk20 --flavor g3.cores1.ram2.disk20 --coe kubernetes --labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true --public
Upgrading the cluster
openstack coe cluster upgrade rook3 core-34-k8s22-100g
This updates the control node, though worker doesn't, cluster status marked as upgrade failed, gives:
Upgrading a cluster when status is "UPDATE_FAILED" is not supported (HTTP 400) (Request-ID: req-5f1a6d9f-6331-4d12-8be9-d379e6cec237)
on retrying upgrade command. The following clears upgrade failed and returns status to UPDATE_IN_PROGRESS:
openstack coe cluster resize <cluster name> <current number of worker nodes>
Eventually it marks "UPDATE_COMPLETE" but it is only the control node that is updated, not the worker. Running the upgrade again doesn't seem to do anything. Scaling the cluster to 0 workers, then back to one seems to get the cluster all up
graded. Though the haproxy needs updated to point to the new worker nodes at this point.
The value of this upgrade method is somewhat questionable when compared to deploying a new cluster and redeploying PAWS to it, as the latter is fairly trivial. The deploying a new cluster method also has no downtime (Though in PAWS case may cause existing auth sessions to get confused, requiring user to have to log out and back in), and tests disaster recovery on each upgrade.