Page MenuHomePhabricator

Upgrade PAWS k8s to 1.17
Closed, ResolvedPublic

Description

Now that the toolsbeta cluster is running 1.17 successfully, it seems like a good idea to run PAWS in it as well. I've run it locally at version 1.19, so I'm quite sure it will work in 1.17 since maintain_kubeusers does and local PAWS does.

1.16 is EoL.

Warning on this: PAWS uses a stacked control plane, so we must be careful about etcd and not be too hasty.
https://v1-17.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes

Event Timeline

Change 644202 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: kubeadm: refresh version defaults

https://gerrit.wikimedia.org/r/644202

Change 644202 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: kubeadm: refresh version defaults

https://gerrit.wikimedia.org/r/644202

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T12:49:14Z] <arturo> set hiera profile::wmcs::kubeadm::component: 'thirdparty/kubeadm-k8s-1-17' at project level (T268669)

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T12:49:47Z] <arturo> disable puppet in all k8s nodes to prepare for the upgrade (T268669)

aborrero triaged this task as Medium priority.Nov 30 2020, 3:40 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T15:49:56Z] <bstorm> draining paws-k8s-control-1 for upgrade T268669

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T15:53:17Z] <bstorm> proceeding with upgrade to 1.17 on paws-k8s-control-1 T268669

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T16:17:20Z] <bstorm> starting upgrade on paws-k8s-control-2 T268669 (first kubectl drain paws-k8s-control-2 --ignore-daemonsets)

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T16:31:44Z] <bstorm> upgrading pods on paws-k8s-control-3 T268669

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T17:14:05Z] <bstorm> updated the calico-kube-controllers deployment to use our internal registry to deal with docker-hub rate-limiting T268669 T269016

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T17:25:25Z] <bstorm> upgrading the worker nodes (this will likely kill services briefly when some pods are rescheduled) T268669

Mentioned in SAL (#wikimedia-cloud) [2020-11-30T18:22:00Z] <bstorm> 1.17 upgrade for kubernetes complete T268669

Bstorm claimed this task.

Saving here the notes from the upgrade for future reference, https://etherpad.wikimedia.org/p/WMCS-2020-11-30-paws-k8s-upgrade

= PAWS k8s upgrade 2020-11-30 =

Task: https://phabricator.wikimedia.org/T268669
Docs: https://v1-17.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/
Docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes

== control plane ==

* {{done}} hiera change in project:
    from
    profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-16
    to
    profile::wmcs::kubeadm::component: 'thirdparty/kubeadm-k8s-1-17'

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/00a55cca66abc1ea05c73ec2eacd6645f979fa0e%5E%21/


* {{done}} run puppet agent to pick up the hiera change:

aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O{project:paws name:^paws-k8s-[iwc].*}' 'run-puppet-agent'

* {{done}} verify kubeadm package version is right:

kubeadm:
  Installed: 1.16.10-00
  Candidate: 1.17.13-00
  Version table:
     1.17.13-00 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/thirdparty/kubeadm-k8s-1-17 amd64 Packages
 *** 1.16.10-00 100
        100 /var/lib/dpkg/status

* install latest kubeadm version in control nodes (as we go...must not be before upgrade plan or kubelet freaks out in this upgrade)

* {done} kubectl drain paws-k8s-control-1 --ignore-daemonsets
* {done} kubeadm upgrade plan 1.17.13

root@paws-k8s-control-1:~# kubeadm upgrade plan 1.17.13
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.16.10
[upgrade/versions] kubeadm version: v1.16.10

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT         AVAILABLE
Kube Proxy           v1.16.10   1.17.13
CoreDNS              1.6.2      1.6.2
Etcd                 3.3.15     3.3.15-0

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply 1.17.13

Note: Before you can perform this upgrade, you have to update kubeadm to 1.17.13.

_____________________________________________________________________

* {done}  install latest kubeadm version (upgrades cni version as well, which is where things can get weird)

The following packages will be upgraded:
  kubeadm kubernetes-cni
2 upgraded, 0 newly installed, 0 to remove and 3 not upgraded.
Need to get 33.1 MB of archives.
After this operation, 20.9 MB of additional disk space will be used.
Do you want to continue? [Y/n]
Get:1 http://apt.wikimedia.org/wikimedia buster-wikimedia/thirdparty/kubeadm-k8s-1-17 amd64 kubernetes-cni amd64 0.8.7-00 [25.0 MB]
Get:2 http://apt.wikimedia.org/wikimedia buster-wikimedia/thirdparty/kubeadm-k8s-1-17 amd64 kubeadm amd64 1.17.13-00 [8,066 kB]
Fetched 33.1 MB in 0s (68.3 MB/s)
(Reading database ... 67983 files and directories currently installed.)
Preparing to unpack .../kubernetes-cni_0.8.7-00_amd64.deb ...
Unpacking kubernetes-cni (0.8.7-00) over (0.7.5-00) ...
Preparing to unpack .../kubeadm_1.17.13-00_amd64.deb ...
Unpacking kubeadm (1.17.13-00) over (1.16.10-00) ...
Setting up kubernetes-cni (0.8.7-00) ...
Setting up kubeadm (1.17.13-00) ...

* {done} kubeadm upgrade apply v1.17.13

root@paws-k8s-control-1:~# kubeadm upgrade apply 1.17.13
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/version] You have chosen to change the cluster version to "v1.17.13"
[upgrade/versions] Cluster version: v1.16.10
[upgrade/versions] kubeadm version: v1.17.13
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler etcd]
[upgrade/prepull] Prepulling image for component etcd.
[upgrade/prepull] Prepulling image for component kube-controller-manager.
[upgrade/prepull] Prepulling image for component kube-scheduler.
[upgrade/prepull] Prepulling image for component kube-apiserver.
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-apiserver
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-etcd
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-etcd
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[upgrade/prepull] Prepulled image for component etcd.
[upgrade/prepull] Prepulled image for component kube-scheduler.
[upgrade/prepull] Prepulled image for component kube-controller-manager.
[upgrade/prepull] Prepulled image for component kube-apiserver.
[upgrade/prepull] Successfully prepulled the images for all the control plane components
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.17.13"...
Static pod: kube-apiserver-paws-k8s-control-1 hash: 6239b2cd1110870c8bc691b23d72aad5
Static pod: kube-controller-manager-paws-k8s-control-1 hash: bf5c769e24ccb5e5b627eb183970be9e
Static pod: kube-scheduler-paws-k8s-control-1 hash: 1a94b0dae4fa71a906a01214e60bceb2
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-paws-k8s-control-1 hash: d3b8dc2699c62a9b4b2eb5b86c7da091
[upgrade/staticpods] Preparing for "etcd" upgrade
[upgrade/staticpods] Renewing etcd-server certificate
[upgrade/staticpods] Renewing etcd-peer certificate
[upgrade/staticpods] Renewing etcd-healthcheck-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-11-30-15-54-02/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: etcd-paws-k8s-control-1 hash: d3b8dc2699c62a9b4b2eb5b86c7da091
Static pod: etcd-paws-k8s-control-1 hash: d3b8dc2699c62a9b4b2eb5b86c7da091
Static pod: etcd-paws-k8s-control-1 hash: c4de4fd277e953a661b754ff11023a90
[apiclient] Found 3 Pods for label selector component=etcd
[upgrade/staticpods] Component "etcd" upgraded successfully!
[upgrade/etcd] Waiting for etcd to become available
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests567360152"
[controlplane] Adding extra host path mount "admission-config-dir" to "kube-apiserver"
W1130 15:54:09.791202    8629 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Renewing apiserver-etcd-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-11-30-15-54-02/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-paws-k8s-control-1 hash: 6239b2cd1110870c8bc691b23d72aad5
Static pod: kube-apiserver-paws-k8s-control-1 hash: 4c94e8edcfb82128c08bbdfb98212cc7
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-11-30-15-54-02/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-paws-k8s-control-1 hash: bf5c769e24ccb5e5b627eb183970be9e
Static pod: kube-controller-manager-paws-k8s-control-1 hash: df3fdce2d7afdd5a2a0db4bc88195953
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2020-11-30-15-54-02/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-scheduler-paws-k8s-control-1 hash: 1a94b0dae4fa71a906a01214e60bceb2
Static pod: kube-scheduler-paws-k8s-control-1 hash: 136f43f47ab20f6691db58d3b2a196da
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.17" in namespace kube-system with the configuration for the kubelets in the cluster
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.17" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons]: Migrating CoreDNS Corefile
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.17.13". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.

root@paws-k8s-control-1:~# kubectl get nodes
NAME                 STATUS                     ROLES     AGE    VERSION
paws-k8s-control-1   Ready,SchedulingDisabled   master    187d   v1.16.10
paws-k8s-control-2   Ready                      master    187d   v1.16.10
paws-k8s-control-3   Ready                      master    187d   v1.16.10
paws-k8s-ingress-1   Ready                      ingress   179d   v1.16.10
paws-k8s-ingress-2   Ready                      ingress   179d   v1.16.10
paws-k8s-worker-1    Ready                      <none>    187d   v1.16.10
paws-k8s-worker-2    Ready                      <none>    187d   v1.16.10
paws-k8s-worker-3    Ready                      <none>    187d   v1.16.10
paws-k8s-worker-4    Ready                      <none>    187d   v1.16.10
paws-k8s-worker-5    Ready                      <none>    157d   v1.16.10
paws-k8s-worker-6    Ready                      <none>    157d   v1.16.10
paws-k8s-worker-7    Ready                      <none>    157d   v1.16.10

Version won't upgrade until kubelet does.

* {done} kubectl uncordon paws-k8s-control-1
* {done} cp /etc/kubernetes/admin.conf .kube/config (as root, of course--not user)
* {done} apt-get install kubelet kubectl (left this off until the very end of control plane upgrade based on upstream docs...but I'm unsure of best place to do it)
* {done} systemctl restart kubelet (this should happen from the install?)

* {done} kubectl drain paws-k8s-control-2 --ignore-daemonsets
* {done} apt install kubeadm
* {done} kubeadm upgrade node
* {done} kubectl uncordon paws-k8s-control-2
* {done} cp /etc/kubernetes/admin.conf .kube/config

* {done} kubectl drain paws-k8s-control-3 --ignore-daemonsets
* {done} apt install kubeadm
* {done} kubeadm upgrade node
* {done} cp /etc/kubernetes/admin.conf .kube/config
* {done} kubectl uncordon paws-k8s-control-3

* {done} kubectl, cri-tools and kubelet upgraded on all three nodes

root@paws-k8s-control-2:~# kubectl get nodes
NAME                 STATUS   ROLES     AGE    VERSION
paws-k8s-control-1   Ready    master    187d   v1.17.13
paws-k8s-control-2   Ready    master    187d   v1.17.13
paws-k8s-control-3   Ready    master    187d   v1.17.13


* {done} hiera change in project:
    from
    profile::wmcs::kubeadm::kubernetes_version: 1.16.10
    to
    profile::wmcs::kubeadm::kubernetes_version: 1.17.13

== data plane ==

user@laptop: ops/puppet.git $ cat nodelist.txt
paws-k8s-ingress-1
paws-k8s-ingress-2
paws-k8s-worker-1
paws-k8s-worker-2
paws-k8s-worker-3
paws-k8s-worker-4
paws-k8s-worker-5
paws-k8s-worker-6
paws-k8s-worker-7

user@laptop: ops/puppet.git $ modules/kubeadm/files/wmcs-k8s-node-upgrade.py --control paws-k8s-control-1 --project paws --domain eqiad1.wikimedia.cloud --file nodelist.txt

Note: during this we got:
    `error when evicting pod "hub-6449d4d46c-cth9k" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod "hub-6449d4d46c-cth9k"`

which will loop until it is resolved. The resolution is to `kubectl delete` the pod.

Also, the jupyterhub proxy proxy-5bc459dbd4-rpbpf. This is all fixed by increasing the replica count, really.