Page MenuHomePhabricator

Upgrade the Toolforge Kubernetes cluster to v1.16
Closed, ResolvedPublic

Description

Per https://github.com/kubernetes/community/blob/master/contributors/design-proposals/release/versioning.md
v1.15 is going to fall off patch support soon. We need to upgrade to v1.16 ASAP.

There will be several subtasks for this because to do the upgrade we need to fix up webservice, a mistakened use of a deprecated API in maintain-kubeusers and likely other things.

This will start support for ipv6, but support for ipv6 will be far better in 1.18 (now released).

Related Objects

StatusSubtypeAssignedTask
DuplicateChicocvenancio
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedJprorama
Resolvedaborrero
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
Resolved dduvall
ResolvedBstorm
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
DeclinedNone
Resolvedaborrero
OpenNone
Resolvedaborrero
StalledNone
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
Resolvedyuvipanda
DuplicateNone
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
DuplicateNone
ResolvedBstorm
Resolvedaborrero
DuplicateNone
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
DuplicateNone
Resolvedaborrero
OpenNone
ResolvedBstorm
Resolvedbd808
Invalidaborrero
Resolvedbd808
Resolvedbd808
ResolvedSecurityBstorm
Resolvedaborrero
Resolvedbd808
DuplicateNone
ResolvedBstorm
Resolvedbd808
Resolvedbd808
ResolvedBstorm
DuplicateNone
Resolvedaborrero
DuplicateChicocvenancio
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
ResolvedBstorm
OpenNone
OpenFeatureNone
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
ResolvedNone
DeclinedNone
Resolvedbd808
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
ResolvedBstorm
Resolvedtaavi
ResolvedBstorm
Resolvedtaavi
ResolvedBstorm
Resolvedaborrero

Event Timeline

Bstorm triaged this task as High priority.Feb 25 2020, 4:37 PM
Bstorm created this task.
bd808 added a parent task: Restricted Task.Feb 25 2020, 5:26 PM
Bstorm updated the task description. (Show Details)

This deprecation will probably catch some hand-built deployments:

  • Deployment in the extensions/v1beta1, apps/v1beta1, and apps/v1beta2 API versions is no longer served
    • Migrate to use the apps/v1 API version, available since v1.9. Existing persisted data can be retrieved/updated via the new version.

Looks like I updated the example at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes#Example_deployment.yaml in January though, so at least our docs are in reasonable shape. Worth remembering for the eventual announcement. I think I read some stuff about these being difficult to find in a live cluster because the API does forward & backward transformations, meaning that they all query on the old and new namespaces while they are supported in the cluster.

Change 598093 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-kubeadm: kubeadm 1.16 requires docker 18.09

https://gerrit.wikimedia.org/r/598093

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T09:30:44Z] <arturo> set profile::wmcs::kubeadm::component: 'thirdparty/kubeadm-k8s-1-16' at project level for trying T246122

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T09:56:02Z] <arturo> installing kubectl/kubeadm 1.16.9 on k8s control nodes (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T09:57:24Z] <arturo> installing kubectl/kubeadm 1.16.9 on k8s worker nodes (T246122)

NOTE: kubeadm suggest we should upgrade etcd, but 3.2 is what we have for now, in both Debian and the WMF repos.
External components that should be upgraded manually before you upgrade the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT   AVAILABLE
Etcd        3.2.26    3.3.10

Mentioned in SAL (#wikimedia-operations) [2020-05-26T14:44:45Z] <arturo> upgrade packages in buster-wikimedia/thirdpardy/kubeadm-k8s-1-16 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T14:54:08Z] <arturo> bump installed version of kubeadm and kubectl to 1.16.10 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T14:54:21Z] <arturo> aborrero@toolsbeta-test-k8s-control-1:~ $ sudo -i kubeadm upgrade apply v1.16.10 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T15:02:04Z] <arturo> first k8s upgrade failed for yet-to-be-known reasons (T246122)

Change 598093 merged by Bstorm:
[operations/puppet@production] toolforge-kubeadm: kubeadm 1.16 requires docker 18.09

https://gerrit.wikimedia.org/r/598093

I see there were psp changes around 1.16 https://github.com/kubernetes/kubernetes/pull/77792
That isn't likely to be our issue, but something to be aware of.

@aborrero I think I know what is wrong in Toolsbeta. It is the same thing that I saw just now on paws. There is an error in the kubeadm config (which becomes the kubeadm configmap). The name of the extra volume needed for encryption and some other important config for the apiserver is wrong. I must have done this by mistake somewhere during that very long security eval. I made changes in place instead of rebuilding clusters, so I never saw the discrepancy.

The name of the extra volume in the various kubeadm configs is currently invalid:
from the command kubectl get cm -n kube-system kubeadm-config -o yaml

apiServer:
  extraArgs:
    authorization-mode: Node,RBAC
    enable-admission-plugins: PodSecurityPolicy,PodPreset,NodeRestriction,EventRateLimit
    admission-control-config-file: /etc/kubernetes/admission/admission.yaml
    encryption-provider-config: /etc/kubernetes/admission/encryption-conf.yaml
    runtime-config: settings.k8s.io/v1alpha1=true
    tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
    profiling: "false"
  extraVolumes:
    - name: "/etc/kubernetes/admission"
      hostPath: "/etc/kubernetes/admission"
      mountPath: "/etc/kubernetes/admission"
      readOnly: true
      pathType: Directory

The error is - name: "/etc/kubernetes/admission"

That should read: - name: admission-config-dir

That is the correct value from the actual live manifests at /etc/kubernetes/manifests/api-server.yaml
Because I changed the api-server manifest directly when I did this, the error would never show up until we tried to upgrade the cluster with kubeadm where the configmap is suddenly in use again. I'll change the value in puppet immediately, and then I will change it in kubeadm's configmaps.

To be clear, this would prevent the api-server pod from starting after upgrade. I suspect that's exactly what caused the error you saw (partly because it is very similar to my kubeadm init error and because the pod cannot start with that value for a volume name).

Change 598792 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] kubeadm: fix broken definition of extra volume

https://gerrit.wikimedia.org/r/598792

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T16:17:41Z] <bstorm_> fix incorrect volume name in kubeadm-config T246122

Mentioned in SAL (#wikimedia-cloud) [2020-05-26T16:20:24Z] <bstorm_> fix incorrect volume name in kubeadm-config configmap T246122

Change 598792 merged by Bstorm:
[operations/puppet@production] kubeadm: fix broken definition of extra volume

https://gerrit.wikimedia.org/r/598792

I think this should be unblocked and the upgrade might work on the next try. Probably should depool control plane nodes before upgrading then repooling them per https://v1-16.docs.kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ since that is the newer procedure (in case that fixes anything that my fix didn't--the thing I fixed would have stopped the upgrade no matter what). I don't think we should worry a bit about fussing with haproxy during the upgrade because the tooling should all be compatible between the two versions. The big thing we must check before the tools upgrade is to make sure that all the objects created with old definitions are still working on the upgraded cluster...presuming we get the upgrade rolling.

Mentioned in SAL (#wikimedia-cloud) [2020-05-27T10:58:08Z] <arturo> running aborrero@toolsbeta-test-k8s-control-1:~ $ sudo -i kubeadm upgrade apply v1.16.10 and this time it works! (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-27T10:58:52Z] <arturo> running aborrero@toolsbeta-test-k8s-control-1:~ $ sudo apt-get install kubelet -y in the 1.16 version from the component repo (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-27T11:02:58Z] <arturo> upgraded the rest of the k8s control plane nodes to 1.16.10 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-27T11:05:13Z] <arturo> trying modules/kubeadm/files/wmcs-k8s-node-upgrade.py --control toolsbeta-test-k8s-control-1 --project toolsbeta --domain eqiad.wmflabs --src-version 1.15 --dst-version 1.16.10 -n toolsbeta-test-k8s-worker-1 -n toolsbeta-test-k8s-worker-2 -n toolsbeta-test-k8s-worker-3 (T246122)

Change 599003 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] kubeadm: fix some inconsistencies in the worker upgrade script

https://gerrit.wikimedia.org/r/599003

Mentioned in SAL (#wikimedia-cloud) [2020-05-27T12:02:37Z] <arturo> the k8s cluster is now running v1.16.10 (T246122)

Change 599003 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] kubeadm: fix some inconsistencies in the worker upgrade script

https://gerrit.wikimedia.org/r/599003

I'd forgot to check deprecated objects by the end of the day yesterday, but I checked this morning in Toolsbeta...and there may not be any there. I replaced all the PSPs already in tools and TB as I recall and the deployments there are replaced.

We should be ok, but if anyone's deployment stops working, webservice stop/start will replace it.

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T15:09:48Z] <arturo> upgrading tools-k8s-control-1 to 1.16.10 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T15:17:39Z] <arturo> upgrading tools-k8s-control-2 to 1.16.10 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T15:41:15Z] <arturo> upgrading tools-k8s-control-3 to 1.16.10 (T246122)

It's looking good after a short problem:

bstorm@tools-sgebastion-08:~$ kubectl --as-group=system:masters --as=admin get nodes
NAME                  STATUS   ROLES    AGE    VERSION
tools-k8s-control-1   Ready    master   204d   v1.16.10
tools-k8s-control-2   Ready    master   203d   v1.16.10
tools-k8s-control-3   Ready    master   203d   v1.16.10
tools-k8s-worker-1    Ready    <none>   203d   v1.15.6
tools-k8s-worker-10   Ready    <none>   142d   v1.15.6

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T15:58:39Z] <arturo> upgrading tools-k8s-worker-[1..10] to 1.16.10 (T246122)

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T16:01:28Z] <bstorm_> kubectl upgraded to 1.16.10 on all bastions T246122

We discovered that there is a bug in kubeadm < 1.17 that sets renew-certs to false on node upgrades. The control plane certs rotated fine, but the kubelet certs of worker nodes did not. https://github.com/kubernetes/kubeadm/issues/1818 This is also referenced in the docs here https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/

Worker node 1-6 will need manual updates to their certs if we don't upgrade again before then.

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T17:54:22Z] <bstorm_> upgraded tools-k8s-worker-[11..15] and starting on -21-29 now T246122

Looking deeper into things, I think kubeadm is confusingly documented (we knew that). In order to upgrade the client cert for kubelet, we can simply set the kubelets to do it for us with a feature gate. The settings are here https://kubernetes.io/docs/tasks/tls/certificate-rotation/
This is distinct from *serving certificate rotation*, which we deliberately avoided. I'll make another task and a patch to add the args to our kubelets.

I did confirm our control plane certs look right.

Never mind! The blasted config is the default on this version: RotateKubeletClientCertificate=true|false (BETA - default=true) from https://v1-16.docs.kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

We should watch the cert behavior when the test cluster in toolsbeta should be ready to renew. If that fails, then we'll explicitly add the options, which are all marked as deprecated. K8s docs are fun, right? I believe this is why I didn't enable them during the design phase. It's hard to remember all these details.

NOTE: I copied the admin.conf to .kube/config for the root account on each control plane node because I realized our upgrade renewed that cert :)

Mentioned in SAL (#wikimedia-cloud) [2020-05-28T21:06:34Z] <bstorm_> upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 T246122

Change 599472 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge-k8s: proposing removing hostkey checking for the upgrades

https://gerrit.wikimedia.org/r/599472

Change 599472 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge-k8s: proposing removing hostkey checking for the upgrades

https://gerrit.wikimedia.org/r/599472

Bstorm claimed this task.
Bstorm updated the task description. (Show Details)

I think we are done with this one!