Page MenuHomePhabricator

Deploy upgraded Kubernetes to toolsbeta
Closed, ResolvedPublic

Description

This is the epic for the first step of putting this up in beta before it goes live.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -0
operations/software/tools-webservicemaster+35 -16
cloud/toolforge/ingress-admission-controllermaster+1 -1
operations/docker-images/toollabs-imagesmaster+21 -0
operations/puppetproduction+9 -0
labs/tools/maintain-kubeusersmaster+20 -3
operations/puppetproduction+13 -287
operations/puppetproduction+0 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -0
operations/puppetproduction+0 -1
operations/puppetproduction+3 -3
operations/puppetproduction+9 -8
operations/puppetproduction+48 -93
operations/puppetproduction+9 -8
operations/puppetproduction+47 -93
operations/puppetproduction+11 -45
operations/puppetproduction+54 -21
operations/puppetproduction+10 -2
operations/puppetproduction+85 -3
operations/puppetproduction+11 -52
operations/puppetproduction+2 -1
operations/puppetproduction+0 -9
operations/puppetproduction+24 -1
operations/puppetproduction+770 -0
operations/puppetproduction+29 -23
operations/puppetproduction+0 -2
operations/puppetproduction+5 -0
operations/puppetproduction+17 -1
operations/puppetproduction+801 -3
operations/puppetproduction+3 -1
operations/puppetproduction+9 -1
operations/puppetproduction+4 -1
operations/puppetproduction+22 -1
operations/puppetproduction+10 -12
operations/puppetproduction+15 -0
operations/puppetproduction+106 -2
operations/puppetproduction+2 -1
operations/puppetproduction+0 -1
operations/puppetproduction+68 -0
operations/puppetproduction+1 -1
operations/puppetproduction+33 -2
operations/puppetproduction+30 -18
operations/puppetproduction+24 -16
operations/puppetproduction+1 -1
operations/puppetproduction+0 -7
operations/puppetproduction+10 -0
operations/puppetproduction+106 -0
operations/puppetproduction+4 -3
operations/puppetproduction+7 -1
operations/puppetproduction+130 -0
operations/puppetproduction+9 -3
operations/puppetproduction+70 -0
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
Resolvedbd808
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
DeclinedNone
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
Resolved dduvall
Resolved Bstorm
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolved Bstorm
DeclinedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledNone
Resolvedaborrero
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We just tested the lifecycle again, and it seems to work:

root@toolsbeta-test-k8s-master-1:~# kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
[...]
root@toolsbeta-test-k8s-master-1:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config
root@toolsbeta-test-k8s-master-1:~# kubectl apply -f /etc/kubernetes/calico.yaml
[...]

For other control plane nodes:

root@toolsbeta-test-k8s-master-1:~# kubeadm --config /etc/kubernetes/kubeadm-init.yaml init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
0e323a45a4212c78994e30f8f3b9a6f77a1b475e696e12e7bf5f7cbd72ea5871
root@toolsbeta-test-k8s-master-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
3637ded9d0ac4e45952214e43b3107055d090ea0c13a176c4607f907662034f1

root@toolsbeta-test-k8s-master-2:~# kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output> --experimental-control-plane --certificate-key <upload_certs_output>
[...]

For worker nodes:

aborrero@toolsbeta-test-k8s-worker-1:~ $ sudo kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output>

Note that:

  • deleting a node requires kubectl delete node <nodename (case of VM deletion), adding a node requires the steps outlined above.
  • we use puppet certs for the etcd client connection
  • we enforce client certs on etcd server side

I went ahead and tried this:

root@toolsbeta-test-k8s-master-1:~# kubeadm upgrade plan
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/versions] Latest stable version: v1.15.0
[upgrade/versions] Latest version in the v1.15 series: v1.15.0

Awesome, you're up-to-date! Enjoy!

So basically, we are at the latest still. The docs say kubeadm can be used to downgrade, but it provides no guidance and the tooling seems...not so good for that. If we want to test upgrading for whatever reason, which seems like a much more straightforward process than most of what we've done, we'd need to deploy a cluster with v1.14.4, then upgrade to v1.15.0. Kubeadm upgrade behaves differently in the 1.15 series, though. It refreshes all node certs as it upgrades, so it will not necessarily predict how upgrades will behave in the future. I suspect we may be better off trying out upgrades in beta when a new release happens (1.15.1).

I say that partly because we have a lot of work to do to get this "toolforge ready" now that we've got a handle on a process for kubeadm itself.

Mentioned in SAL (#wikimedia-cloud) [2019-07-17T09:13:42Z] <arturo> create VM toolsbeta-test-k8s-master-4 (Debian Buster) T215531

Mentioned in SAL (#wikimedia-cloud) [2019-07-17T09:51:30Z] <arturo> re-create VM toolsbeta-test-k8s-worker-1 as Debian Buster T215531

Change 524281 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: include the kubeadm_docker_service

https://gerrit.wikimedia.org/r/524281

Change 524281 merged by Bstorm:
[operations/puppet@production] toolforge: include the kubeadm_docker_service

https://gerrit.wikimedia.org/r/524281

Ok, the cluster is now using PSP on init, and it works fine. I have no idea what caused our problem before, but a clean rebuild works great.

Since this works perfectly now (for whatever reason--I have theories that don't ultimately matter much now), the final form of the build process now looks like this:

We just tested the lifecycle again, and it seems to work:

root@toolsbeta-test-k8s-master-1:~# kubeadm init --config /etc/kubernetes/kubeadm-init.yaml --upload-certs
[...]
root@toolsbeta-test-k8s-master-1:~# cp /etc/kubernetes/admin.conf $HOME/.kube/config

Right here, before calico, you need to run:

kubectl apply -f /etc/kubernetes/kubeadm-system-psp.yaml

That will bring the admin pods online and allow calico to spin up as well. No other pods will be permitted unless they are in kube-system until we add another manifest to handle the toolforge pods. That's the topic of T227290, though.

root@toolsbeta-test-k8s-master-1:~# kubectl apply -f /etc/kubernetes/calico.yaml
[...]

For other control plane nodes:
```lang=shell-session
root@toolsbeta-test-k8s-master-1:~# kubeadm --config /etc/kubernetes/kubeadm-init.yaml init phase upload-certs --upload-certs
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
0e323a45a4212c78994e30f8f3b9a6f77a1b475e696e12e7bf5f7cbd72ea5871
root@toolsbeta-test-k8s-master-1:~# openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'
3637ded9d0ac4e45952214e43b3107055d090ea0c13a176c4607f907662034f1

root@toolsbeta-test-k8s-master-2:~# kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output> --experimental-control-plane --certificate-key <upload_certs_output>
[...]

For worker nodes:

aborrero@toolsbeta-test-k8s-worker-1:~ $ sudo kubeadm join toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 --token m7uakr.ern5lmlpv7gnkacw --discovery-token-ca-cert-hash sha256:<openssl_output>

Note that:

  • deleting a node requires kubectl delete node <nodename (case of VM deletion), adding a node requires the steps outlined above.
  • we use puppet certs for the etcd client connection
  • we enforce client certs on etcd server side

Huge progress :)

Change 524310 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: remove class redeclaration

https://gerrit.wikimedia.org/r/524310

To explain this patch and the one where I changed the docker service class:
The docker service class being left out of master since it was easy to forget. I made it an include at the module level (to make the module functional and internally consistent) instead of declaring it in class context in the profile. Separating it out like that is how we manage roles to keep them flexible (which I get), but doing it at the module level makes modules require unusual quirks and insider knowledge just to make them work. Modules are developed elsewhere with a primary init.pp gateway that accepts all options, with most else configured by that interface. I'm fine not using the init pattern in modules, but I'd rather not make it more confusing as well by splitting it out too much.

I'm open to discussion, but I am changing the node profile so that it will work (what I did broke the node profile...but not the master one because it was forgotten there). That's just so it isn't left in a broken state because of how I changed it. I caught the missing material because of warnings during the init preflight phase about the docker config being missing. --So you don't think I'm just being picky or weird about it @aborrero :)

Change 524310 merged by Bstorm:
[operations/puppet@production] toolforge: remove class redeclaration

https://gerrit.wikimedia.org/r/524310

Change 525112 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: k8s: kubadm: calico requires ipset

https://gerrit.wikimedia.org/r/525112

Change 525112 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: k8s: kubadm: calico requires ipset

https://gerrit.wikimedia.org/r/525112

Change 525339 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: set kubeadm to use internal registry for pause container

https://gerrit.wikimedia.org/r/525339

Mentioned in SAL (#wikimedia-cloud) [2019-07-24T20:48:19Z] <bstorm_> rebuilt toolsbeta-test cluster with the internal version of the pause container T228887 T215531

Change 525339 merged by Bstorm:
[operations/puppet@production] toolforge: set kubeadm to use internal registry for pause container

https://gerrit.wikimedia.org/r/525339

Change 525434 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: add internal pause container to all the other kubelets

https://gerrit.wikimedia.org/r/525434

Change 525434 merged by Bstorm:
[operations/puppet@production] toolforge: add internal pause container to all the other kubelets

https://gerrit.wikimedia.org/r/525434

Change 525436 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: fix typo kubelet file content

https://gerrit.wikimedia.org/r/525436

In the end this works, however, only the init config and presumably a join config file accept the new pause container gracefully. The other control plane nodes (which cannot use a config) require this to be appended to the end of the mess. Luckily, later options overwrite earlier ones, so as soon as the node reboots (or docker & kubelet restart), it works regardless of having two conflicting CLI args on the kubelet command. This works, though, and it is consistent. The only design difference we could do in future might be to use a join config for non-control-plane nodes.

Change 525436 merged by Bstorm:
[operations/puppet@production] toolforge: fix typo kubelet file content

https://gerrit.wikimedia.org/r/525436

Ok, great news, we can try a kubeadm upgrade now.

# kubeadm upgrade plan
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/versions] Latest stable version: v1.15.1
[upgrade/versions] Latest version in the v1.15 series: v1.15.1

External components that should be upgraded manually before you upgrade the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT   AVAILABLE
Etcd        3.2.26    3.3.10

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT       AVAILABLE
Kubelet     5 x v1.15.0   v1.15.1

Upgrade to the latest version in the v1.15 series:

COMPONENT            CURRENT   AVAILABLE
API Server           v1.15.0   v1.15.1
Controller Manager   v1.15.0   v1.15.1
Scheduler            v1.15.0   v1.15.1
Kube Proxy           v1.15.0   v1.15.1
CoreDNS              1.3.1     1.3.1

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply v1.15.1

Note: Before you can perform this upgrade, you have to update kubeadm to v1.15.1.

_____________________________________________________________________

We should not be required to upgrade etcd, but it will probably tells us about any time we do this. Since this is a great testing opportunity, I'm running it.

Interestingly (but not surprisingly), it asks that first we upgrade kubeadm.

# kubeadm upgrade apply v1.15.1
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/version] You have chosen to change the cluster version to "v1.15.1"
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.0
[upgrade/version] FATAL: the --version argument is invalid due to these errors:

        - Specified version to upgrade to "v1.15.1" is higher than the kubeadm version "v1.15.0". Upgrade kubeadm first using the tool you used to install kubeadm

Can be bypassed if you pass the --force flag

To test upgrade, first we'll have to update that from upstream (though it might work with --force). As is, this will still install kubernetes 1.15.0 on kubeadm init because of our config even if we update kubeadm.

@aborrero if you are bored with fighting with the ingress for a bit and want to test this, we just have to update our repo from upstream...however that is done :) I presume that isn't terribly hard? It's not a requirement for this whole thing, but it would be very good to know how "bad" upgrades will be.

Mentioned in SAL (#wikimedia-operations) [2019-07-25T11:03:19Z] <arturo> update stretch-wikimedia/thirdparty/kubeadm-k8s on install1002 for T215531 (kubeadm 1.15.1)

@Bstorm here you go:

aborrero@toolsbeta-test-k8s-master-1:~$ apt-cache policy kubeadm
kubeadm:
  Installed: 1.15.0-00
  Candidate: 1.15.1-00
  Version table:
     1.15.1-00 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/kubeadm-k8s amd64 Packages
 *** 1.15.0-00 100
        100 /var/lib/dpkg/status

Just recording the process as I go here:

# apt install kubeadm
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following packages were automatically installed and are no longer required:
  libopts25 libpcsclite1 python3-debconf
Use 'apt autoremove' to remove them.
The following additional packages will be installed:
  cri-tools
The following packages will be upgraded:
  cri-tools kubeadm
2 upgraded, 0 newly installed, 0 to remove and 7 not upgraded.
Need to get 17.0 MB of archives.
After this operation, 2,250 kB disk space will be freed.
Do you want to continue? [Y/n] 
Get:1 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/kubeadm-k8s amd64 cri-tools amd64 1.13.0-00 [8,776 kB]
Get:2 http://apt.wikimedia.org/wikimedia stretch-wikimedia/thirdparty/kubeadm-k8s amd64 kubeadm amd64 1.15.1-00 [8,247 kB]
Fetched 17.0 MB in 1s (32.4 MB/s)
(Reading database ... 57148 files and directories currently installed.)
Preparing to unpack .../cri-tools_1.13.0-00_amd64.deb ...
Unpacking cri-tools (1.13.0-00) over (1.12.0-00) ...
Preparing to unpack .../kubeadm_1.15.1-00_amd64.deb ...
Unpacking kubeadm (1.15.1-00) over (1.15.0-00) ...
Setting up cri-tools (1.13.0-00) ...
Setting up kubeadm (1.15.1-00) ...
# kubeadm upgrade apply v1.15.1
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/version] You have chosen to change the cluster version to "v1.15.1"
[upgrade/versions] Cluster version: v1.15.0
[upgrade/versions] kubeadm version: v1.15.1
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]:

And I confirmed:

[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/prepull] Prepulling image for component kube-scheduler.
[upgrade/prepull] Prepulling image for component kube-apiserver.
[upgrade/prepull] Prepulling image for component kube-controller-manager.
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-apiserver
[apiclient] Found 0 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-controller-manager
[apiclient] Found 3 Pods for label selector k8s-app=upgrade-prepull-kube-scheduler
[upgrade/prepull] Prepulled image for component kube-scheduler.
[upgrade/prepull] Prepulled image for component kube-controller-manager.
[upgrade/prepull] Prepulled image for component kube-apiserver.
[upgrade/prepull] Successfully prepulled the images for all the control plane components
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.15.1"...
Static pod: kube-apiserver-toolsbeta-test-k8s-master-1 hash: e7a689bf231e30af59efcb56690b440d
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-1 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-scheduler-toolsbeta-test-k8s-master-1 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests422342376"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-14-47-08/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-toolsbeta-test-k8s-master-1 hash: e7a689bf231e30af59efcb56690b440d
Static pod: kube-apiserver-toolsbeta-test-k8s-master-1 hash: 81e3015017da0b319ec4e8fce4116aae
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Renewing controller-manager.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-14-47-08/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-1 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-1 hash: 645e7a8519364c082c136bba3c26849b
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Renewing scheduler.conf certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-14-47-08/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-scheduler-toolsbeta-test-k8s-master-1 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
Static pod: kube-scheduler-toolsbeta-test-k8s-master-1 hash: ecae9d12d3610192347be3d1aa5aa552
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upload-config] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[kubelet] Creating a ConfigMap "kubelet-config-1.15" in namespace kube-system with the configuration for the kubelets in the cluster
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[bootstrap-token] configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstrap-token] configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstrap-token] configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy

[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.15.1". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.
root@toolsbeta-test-k8s-master-1:~#

After that, the kubelets are, of course, not yet upgraded:

# kubectl get nodes -o wide
NAME                          STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
toolsbeta-test-k8s-master-1   Ready    master   18h   v1.15.0   172.16.2.223   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-master-2   Ready    master   18h   v1.15.0   172.16.2.225   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-master-3   Ready    master   17h   v1.15.0   172.16.2.233   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-worker-1   Ready    <none>   18h   v1.15.0   172.16.2.227   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7
toolsbeta-test-k8s-worker-2   Ready    <none>   18h   v1.15.0   172.16.2.231   <none>        Debian GNU/Linux 10 (buster)   4.19.0-5-amd64   docker://18.9.7

And the effect of it:

root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

So we can see that it only updated the control plane node that it ran on.

It is necessary and documented for HA clusters that you must go to the other nodes directly to run the following:

root@toolsbeta-test-k8s-master-2:~# kubeadm upgrade node 
[upgrade] Reading configuration from the cluster...
[upgrade] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Upgrading your Static Pod-hosted control plane instance to version "v1.15.1"...
Static pod: kube-apiserver-toolsbeta-test-k8s-master-2 hash: 7c5b672d7da21ab872a88c8feec039ea
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-2 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-scheduler-toolsbeta-test-k8s-master-2 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests191401975"
[upgrade/staticpods] Preparing for "kube-apiserver" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-15-07-48/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-toolsbeta-test-k8s-master-2 hash: 7c5b672d7da21ab872a88c8feec039ea
Static pod: kube-apiserver-toolsbeta-test-k8s-master-2 hash: 17c3be5ae16d141c9a5708dfc1a87b8e
[apiclient] Found 3 Pods for label selector component=kube-apiserver
[upgrade/staticpods] Component "kube-apiserver" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-controller-manager" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-controller-manager.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-15-07-48/kube-controller-manager.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-2 hash: 389fff2e2e6c803f828653a4f18c838f
Static pod: kube-controller-manager-toolsbeta-test-k8s-master-2 hash: 645e7a8519364c082c136bba3c26849b
[apiclient] Found 3 Pods for label selector component=kube-controller-manager
[upgrade/staticpods] Component "kube-controller-manager" upgraded successfully!
[upgrade/staticpods] Preparing for "kube-scheduler" upgrade
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-scheduler.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2019-07-25-15-07-48/kube-scheduler.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-scheduler-toolsbeta-test-k8s-master-2 hash: 31d9ee8b7fb12e797dc981a8686f6b2b
Static pod: kube-scheduler-toolsbeta-test-k8s-master-2 hash: ecae9d12d3610192347be3d1aa5aa552
[apiclient] Found 3 Pods for label selector component=kube-scheduler
[upgrade/staticpods] Component "kube-scheduler" upgraded successfully!
[upgrade] The control plane instance for this node was successfully updated!
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[upgrade] The configuration for this node was successfully updated!
[upgrade] Now you should go ahead and upgrade the kubelet package using your package manager.

Note that you no longer need to specify "control-plane" or "experimental-control-plane" because that is a phase of the command by default in version 1.15+. If there are control plane pods, it upgrades them.

Now upgrading the package side of things in general on the control plane nodes one at a time. This brings up an interesting point. We should pin or hold the packages at a particular version until we are ready to upgrade in the future, possibly keying off the value from our kubeadm config to set those things.

If the specific packages that are in our repo are manually controlled, perhaps there's no need to mess with it in puppet/apt, though 😁

Change 525569 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: Update the version string to match our software

https://gerrit.wikimedia.org/r/525569

root@toolsbeta-test-k8s-master-1:~# kubectl get nodes
NAME                          STATUS   ROLES    AGE   VERSION
toolsbeta-test-k8s-master-1   Ready    master   19h   v1.15.1
toolsbeta-test-k8s-master-2   Ready    master   19h   v1.15.1
toolsbeta-test-k8s-master-3   Ready    master   18h   v1.15.1
toolsbeta-test-k8s-worker-1   Ready    <none>   18h   v1.15.0
toolsbeta-test-k8s-worker-2   Ready    <none>   18h   v1.15.0
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
root@toolsbeta-test-k8s-master-1:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:09:21Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

After that, it's just kubelet upgrades for the worker nodes. That should be done with drains to minimize disruption. Overall, that makes for a procedure we can document. Naturally, the process for upgrading between major versions is more involved, but the documented upgrades in the official docs are remarkably similar to this procedure, which is good to see.

Change 525569 merged by Bstorm:
[operations/puppet@production] toolforge: Update the version string to match our software

https://gerrit.wikimedia.org/r/525569

Fully thing, a lot of what is fixed in 1.15.1 is the things that annoyed us about etcd and kubeadm for an HA stacked control plane: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.15.md#changelog-since-v1150

One notable thing about the upgrade process as well: it rotates the certificates so they don't expire. Renewing all the certs is an often cited issue. If we keep up with upgrades, we honestly will never have to worry about it.

If the specific packages that are in our repo are manually controlled, perhaps there's no need to mess with it in puppet/apt, though 😁

This approach may not work well once we have N clusters (toolsbeta, tools, anything in codfw that we might add for additional testing) and want to practice an upgrade on clusterA without needing to freeze apt upgrades or capacity expansion in clustersB...N. As long as we are using an apt repo with support for multiple versions of the same package (I think aptly has this restriction?) then pining or explicit versioning in the Puppet manifests should let us run version n+1 in a test cluster without breaking the use of version n in other clusters.

This is true. We are using reprepro, not aptly for packages. I have no idea if we can support multiple package versions in that. The kubernetes API version will not upgrade until told to via kubeadm, but the kubelet must be upgraded by hand (which is what the pinning affects--and the updates are not done by puppet though a new node build would be affected by package changes). As is, we have the version as a configurable field that can be hiera'd for kubeadm init. After init, it makes no difference unless we then also use it to manage the package versions (and kubelet version isn't managed by kubeadm).

Overall, it boils down the question: is it possible to have multiple versions in reprepro or not.

Overall, it boils down the question: is it possible to have multiple versions in reprepro or not.

Yes! it is possible :-)

We have several ways of doing it, but the easier I would say is to just create versioned repo components.

Currently we have:

  • stretch-wikimedia/thirdparty/kubeadm-k8s

We could move to:

  • stretch-wikimedia/thirdparty/kubeadm-k8s-1.15

Anyway I suggest we create another task to discuss the details.

Change 519375 abandoned by Arturo Borrero Gonzalez:
k8s: kubelet: stop requiring ::k8s::infrastructure_config

Reason:
Not following this approach anymore.

https://gerrit.wikimedia.org/r/519375

Change 543815 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refresh puppet code for the new k8s

https://gerrit.wikimedia.org/r/543815

Change 543815 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: refresh puppet code for the new k8s

https://gerrit.wikimedia.org/r/543815

Mentioned in SAL (#wikimedia-cloud) [2019-10-25T23:41:32Z] <bstorm_> Deployed custom webhook controllers for registry and ingress checking to toolsbeta-test kubernetes cluster T215531 T215678 T234231

Change 547668 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[labs/tools/maintain-kubeusers@master] deploy: prepare for deployment in toolsbeta

https://gerrit.wikimedia.org/r/547668

Change 547668 merged by Bstorm:
[labs/tools/maintain-kubeusers@master] deploy: prepare for deployment in toolsbeta

https://gerrit.wikimedia.org/r/547668

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T22:50:56Z] <bstorm_> deployed the new maintain-kubeusers to toolsbeta T215531 T228499

Change 549108 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: Distribute the roles for toolforge users

https://gerrit.wikimedia.org/r/549108

Change 549108 merged by Bstorm:
[operations/puppet@production] toolforge: Distribute the roles for toolforge users

https://gerrit.wikimedia.org/r/549108

Redeployed maintain-kubeusers in toolsbeta:

root@toolsbeta-test-k8s-control-1:/home/bstorm/maintain-kubeusers# kubectl logs maintain-kubeusers-7b6bb8f79d-xc9qb -n maintain-kubeusers
starting a run
Homedir already exists for /data/project/toolschecker
Wrote config in /data/project/toolschecker/.kube/config
PodSecurityPolicy tool-toolschecker-psp already exists
Provisioned creds for tool toolschecker
Homedir already exists for /data/project/admin
Wrote config in /data/project/admin/.kube/config
PodSecurityPolicy tool-admin-psp already exists
Provisioned creds for tool admin
Homedir already exists for /data/project/test2
Wrote config in /data/project/test2/.kube/config
PodSecurityPolicy tool-test2-psp already exists
Provisioned creds for tool test2
Homedir already exists for /data/project/test
Wrote config in /data/project/test/.kube/config
PodSecurityPolicy tool-test-psp already exists
Provisioned creds for tool test
finished run, wrote 4 new accounts

Now, we have tools to test with!

It created working configs so far. Will try migrating a tool today in toolsbeta.

Change 549201 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/docker-images/toollabs-images@master] jessie fixes: port the fix from the base image to the jessie-sssd one

https://gerrit.wikimedia.org/r/549201

Change 549201 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] jessie fixes: port the fix from the base image to the jessie-sssd one

https://gerrit.wikimedia.org/r/549201

Mentioned in SAL (#wikimedia-cloud) [2019-11-06T21:33:29Z] <bstorm_> docker images needed for kubernetes cluster upgrade deployed T215531

Mentioned in SAL (#wikimedia-cloud) [2019-11-06T22:39:00Z] <bstorm_> upgraded repo version of toollabs-webservice in toolsbeta-stretch to 0.49 -- changes for the new k8s cluster T215531

Change 549613 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] new k8s: Fix ingress object and enable toolsbeta ingress creation

https://gerrit.wikimedia.org/r/549613

Change 549616 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[cloud/toolforge/ingress-admission-controller@master] toolsbeta: allow the host toolsbeta.wmflabs.org

https://gerrit.wikimedia.org/r/549616

Change 549616 merged by Bstorm:
[cloud/toolforge/ingress-admission-controller@master] toolsbeta: allow the host toolsbeta.wmflabs.org

https://gerrit.wikimedia.org/r/549616

Mentioned in SAL (#wikimedia-cloud) [2019-11-07T21:55:15Z] <bstorm_> killed pods for ingress admission controller to upgrade to new image T215531

Change 549613 merged by Bstorm:
[operations/software/tools-webservice@master] new k8s: Fix ingress object and enable toolsbeta ingress creation

https://gerrit.wikimedia.org/r/549613

Change 549921 had a related patch set uploaded (by Phamhi; owner: Hieu Pham):
[operations/puppet@production] toolforge: Rename to toolforge-tool-role.yaml due to typo

https://gerrit.wikimedia.org/r/549921

Change 549921 merged by Phamhi:
[operations/puppet@production] toolforge: Rename to toolforge-tool-role.yaml due to typo

https://gerrit.wikimedia.org/r/549921

@aborrero I have noticed a strange behavior in the new proxy in toolsbeta. If I spin up new tools on the old cluster, they are sometimes unreachable over the flannel IP until I reboot the proxy server (!?!). Restarting flannel did not help, only reboot. I also saw it return when I took a service on the new cluster and put it back on the old cluster.

I'm not sure what I'm seeing...if there's some kind of caching going on or what. I did notice that I was able to see a webservice I had stopped as though it was running against the ingress in the new cluster. I'm not really sure what was happening, but it is terribly weird, especially since *some* services (the admin tool) were still reachable over flannel on the old cluster, but not new ones.

It makes me a little concerned about how things will act on deploy in tools if we cannot explain what is happening. I'll try to help troubleshoot this while I am on travel.

I may have found a reason for that behavior. I had stopped kube-proxy on the toolsbeta proxy because it was malfunctioning, but that would also stop it from updating the nat table in iptables. THAT would confuse the service lookup mechanism, which appears to be used by dynamicproxy. I have some opinions on that, but either way, it seems like getting that working might make the problem go away.

(Been kicking around k8s networking a lot here at KubeCon, and it made me realize I should check that)

@aborrero That did it! It works. That problem is not a problem and this process works to migrate to the new cluster (and back): https://wikitech.wikimedia.org/wiki/User:Bstorm/New_k8s_migration 💥

Proof:

toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice stop
Stopping webservice
toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl config use-context toolforge
switched to context "toolforge".
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice --backend kubernetes python3.7 start
Starting webservice......
toolsbeta.test@toolsbeta-sgebastion-04:~$ /usr/bin/kubectl get pods
NAME                    READY   STATUS    RESTARTS   AGE
test-85d69fb4f9-jxq68   1/1     Running   0          16s
toolsbeta.test@toolsbeta-sgebastion-04:~$ curl http://toolsbeta.wmflabs.org/test/
Hello World, from Toolsbeta!
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice stop
Stopping webservice
toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl config use-context default
switched to context "default".
toolsbeta.test@toolsbeta-sgebastion-04:~$ webservice --backend kubernetes python3.7 start
# ** warning trimmed **
Starting webservice.....
toolsbeta.test@toolsbeta-sgebastion-04:~$ kubectl get pods
NAME                   READY     STATUS    RESTARTS   AGE
test-603267139-14lo9   1/1       Running   0          19s
toolsbeta.test@toolsbeta-sgebastion-04:~$ curl http://toolsbeta.wmflabs.org/test/
Hello World, from Toolsbeta!

It flows back and forth quite seemlessly and quickly (at the tiny scale of toolsbeta).

I think the upgraded k8s cluster in toolbeta has been up and running stable for some time now. Resolving this task with the hope we can better focus on the several subtasks we have previous to the final operations in the tools project.