Page MenuHomePhabricator

Upgrade PAWS to Kubernetes 1.21
Closed, ResolvedPublic

Description

for Toolforge see T282942

https://etherpad.wikimedia.org/p/WMCS-2021-11-10-paws-k8s-upgrade

preparation

  • disable puppet on all control/worker/ingress nodes

from: cloud-cumin-04.cloudinfra.eqiad1.wikimedia.cloud

  • sudo cumin project:paws # check that it finds the expected hosts
  • sudo cumin project:paws 'puppet agent --disable "upgrading k8s rook"'
  • update project-wide (plus control-specific) version hiera key

profile::wmcs::kubeadm::component: 'thirdparty/kubeadm-k8s-1-20'
TO
profile::wmcs::kubeadm::component: 'thirdparty/kubeadm-k8s-1-21'

  • update topic on wikimedia-cloud "Status: Ok" to "Status: upgrading paws k8s"

first api server / paws-k8s-control-1.paws.eqiad1.wikimedia.cloud

  • kubectl drain paws-k8s-control-1 --ignore-daemonsets
  • run-puppet-agent --force && apt-get install kubeadm
  • kubeadm upgrade plan 1.21.8
  • kubeadm upgrade apply 1.21.8
  • apt-get install kubelet kubectl docker-ce containerd.io helm
  • systemctl restart kubelet.service docker.service
  • cp /etc/kubernetes/admin.conf /root/.kube/config
  • kubectl uncordon paws-k8s-control-1
  • Don't proceed until you wait a minute or two and observe that the static pods on this control node are NOT crash looping and are working well.
    • if scheduler permission issue, reboot

paws-k8s-control-2.paws.eqiad1.wikimedia.cloud

  • kubectl drain paws-k8s-control-2 --ignore-daemonsets
  • run-puppet-agent --force && apt-get install kubeadm
  • kubeadm upgrade node
  • apt-get install kubelet kubectl docker-ce containerd.io helm
  • systemctl restart kubelet.service docker.service
  • cp /etc/kubernetes/admin.conf /root/.kube/config
  • kubectl uncordon paws-k8s-control-2
  • Don't proceed until you wait a minute or two and observe that the static pods on this control node are NOT crash looping and are working well. (scheduler permission issue, rebooted)

paws-k8s-control-3.paws.eqiad1.wikimedia.cloud

  • kubectl drain paws-k8s-control-3 --ignore-daemonsets
  • run-puppet-agent --force && apt-get install kubeadm
  • kubeadm upgrade node
  • apt-get install kubelet kubectl docker-ce containerd.io helm
  • systemctl restart kubelet.service docker.service
  • cp /etc/kubernetes/admin.conf /root/.kube/config
  • kubectl uncordon paws-k8s-control-3
  • Don't proceed until you wait a minute or two and observe that the static pods on this control node are NOT crash looping and are working well. (control manager perm issue, rebooted)

run ingress and worker nodes with:
user@laptop:~/stuff/wm-puppet.git $ modules/kubeadm/files/wmcs-k8s-node-upgrade.py --control paws-k8s-control-1 --project paws --file nodelist.txt --src 1.20.11 --dst 1.21.8

ingress workers

(use the script only after following the special procedures in https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes#Ingress_nodes)
Though in this case run:

kubectl -n ingress-nginx-gen2 scale deployment ingress-nginx-gen2-controller --replicas=1

As there are only 2 nodes. Either the first or second will result in some downtime as one needs to kill the remaining pod to run this. Rumor has it that we need to scale like this, though that seems to be rumor, might be better to not scale.
paws-k8s-ingress-4
paws-k8s-ingress-3

workers

paws-k8s-worker-7
paws-k8s-worker-6
paws-k8s-worker-5
paws-k8s-worker-4
paws-k8s-worker-3
paws-k8s-worker-2
paws-k8s-worker-1

  • based on last time, seems to be safe to run two upgrade processes at once to speed things up
    • this was on toolforge, not sure if it applies on paws
  • NOTE: ingress nodes will get scheduled on normal workers, and need to be deleted to get them running back on the ingress nodes

finishing touches

  • merge https://github.com/toolforge/paws/pull/148
  • re-enable puppet on paws nodes from cloud-cumin-04.cloudinfra.eqiad1.wikimedia.cloud
  • sudo cumin project:paws 'puppet agent --enable'
  • revert topic changes on -cloud

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedaborrero
Resolvedtaavi
Resolvedtaavi
Resolvedrook
Resolvedtaavi
Resolvedrook
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolved Bstorm
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolved Bstorm
DeclinedNone
Resolvedrook
Resolvedtaavi
ResolvedBUG REPORTNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
Resolvedrook