Upgrade kubernetes clusters to v1.16
Closed, ResolvedPublic53 Estimated Story Points
Actions

Assigned To

Authored By

	akosiaris
	Feb 5 2020, 9:49 AM

Description

We are currently at 1.12.9. This is no longer security supported as a release.

As of Kubernetes 1.19, bugfix support via patch releases for a Kubernetes minor release has increased from 9 months to 1 year.

But:

We are not able to go 1.19 because of calico 3.16 (current version) only supporting 1.18
We are not able to go > 1.16 because of helm2 only supporting 1.16
We can't upgrade to helm3 because that requires k8s >= 1.13
We can't stay on/upgrade to < 1.16 because calico needs at least 1.16

So best bet is currently to update to k8s 1.16 which gives us ingress and CRD support. From that we need to migrate to helm3 and afterwards we are able to continue to k8s > 1.16 (whatever makes sense than).

For the actual upgrade of clusters we have:

K8s upgrades

This is the reinitialize k8s cluster (e.g. don't really update stuff) plan.

Add 1.19 to CI kubeyaml (T266032)
Build k8s 1.16 (T266766)
Read a lot of changelog
Set up the kubernetes codfw staging cluster with stretch (to at least keep the current docker version) + kernel 4.19 + k8s 1.16
- Prove 1.16 is ok and all (use a more sophisticated wording here :P)
- Do we test the /admin part of deployment-charts in CI? (we don't T266670)
- Watch out changed things
  - renamed metrics (probably)
  - Kubernetes daemons (probably changed logging) log to logstash
- Switch staging reference to point to codfw
Reinitialize codfw with 1.16
Reinitialize eqiad with 1.16
Migrate to helm3 (T251305)

Calico upgrades T207804

Prep work for moving the egress policy to charts has been done by the contractor
- Probably double check that all rules are set up in the charts
Quite possibly go the full cluster reinit way and go the latest version
- Decide decide if we are going to be staying with direct access to etcd (version 3?) or try and switch to the kubernetes APIs (T266895)

Build the calico debs, and cni debs, calico-node docker image
Test in a staging cluster (probably during reinit as well?).

Details

Subject	Repo	Branch	Lines +/-
Bump debian/changelog for config file changes	operations/debs/kubernetes	future	+6 -0
Add k8s-staging prometheus instance datasource	operations/puppet	production	+11 -0
Enable k8s-staging prometheus instance in codfw	operations/puppet	production	+1 -1
Add wmf-node-authorization ClusterRoleBinding	operations/deployment-charts	master	+16 -0
Move non-common kubernetes staging values to DC specific files	operations/puppet	production	+5 -12
k8s_infrastructure_users: fix type of client-infrastructure	labs/private	master	+1 -1
Order k8s_infrastructure_users by id	labs/private	master	+42 -41
k8s:apiserver: Manage kube user/group	operations/puppet	production	+13 -0
k8s-codfw-staging: Add DNS RRs	operations/dns	master	+11 -0
deployment_server: Add k8s-staging-codfw	operations/puppet	production	+5 -3
kube-apiserver: Use the infrastructure users file directly	operations/puppet	production	+1 -1
profile::kubernetes::node: Remove toolforge customizations	operations/puppet	production	+25 -36
profile::kubernetes: fold infrastructure_config to profile	operations/puppet	production	+7 -25
k8s: Remove default values for some parameters	operations/puppet	production	+6 -6

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	akosiaris	T244335 Upgrade kubernetes clusters to v1.16
Resolved	apakhomov	T249920 Adjust our helm charts to support kubernetes 1.16
Duplicate	akosiaris	T241076 Define the plan for the upgrade of kubernetes cluster to a security supported release
		Restricted Task
Resolved	akosiaris	T207804 Upgrade Calico
Resolved	apakhomov	T249927 Support kubernetes Egress networkpolicies in our helm charts
Resolved	JMeybohm	T266893 Build calico 3.17.0
Resolved	akosiaris	T266895 Decide if we want to stick with etcd datastore
Resolved	akosiaris	T267539 Archive/Remove deprecated calico gerrit repositories
Resolved	JMeybohm	T267653 Refactor calico deploy strategy
Resolved	JMeybohm	T270081 kube-apiserver errors for calico CRDs
Resolved	JMeybohm	T277877 Set resource requests and limits for calico PODs
Resolved	akosiaris	T280125 Migrate default nework policies (default-network-policy-conf.yaml) to GlobalNetworkPolicies
		Restricted Task
Resolved	JMeybohm	T266032 Test deployment-charts for kubernetes 1.19 compatibility
Resolved	JMeybohm	T266766 Build new kubernetes packages
Resolved	Jelto	T251305 Migrate to helm v3
Resolved	JMeybohm	T268743 Migrate Chartmuseum (python3-docker-report) to use helm3
		Restricted Task
Resolved	Jelto	T295750 Helm chart dependencies no longer in requirements.yaml
Resolved	JMeybohm	T228967 Set up PodSecurityPolicies in clusters
Resolved	Joe	T273427 Add a verify step to docker-pkg
Resolved	JMeybohm	T274254 Check/Rebuild all docker-pkg build docker images running on kubernetes
Resolved	JMeybohm	T274852 Refactor users in production-images
Resolved	JMeybohm	T274262 Rebuild all blubber build docker images running on kubernetes
Resolved	akosiaris	T268747 codfw: 4 VM request for kubernetes staging
Resolved	JMeybohm	T269461 k8s_infrastructure_users: rsyslog and echostore share the same id
Resolved	JMeybohm	T269835 Implement switching of staging clusters
Resolved	JMeybohm	T270063 kube-apiserver flag --admission-control has been deprecated
Resolved	JMeybohm	T270298 kubernetes tmpfiles references path below legacy directory /var/run/
Resolved	JMeybohm	T270302 Restart kubernetes services on package upgrades
Resolved	JMeybohm	T271702 kubestage200* change on every puppet run
Resolved	JMeybohm	T273866 Kubernetes prod and staging share the same cluster key in hiera
Resolved	akosiaris	T275618 Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name
Resolved	JMeybohm	T275641 Clean up/Consolidate kubernetes related dashboards
Resolved	JMeybohm	T276305 Update Kubernetes cluster staging-eqiad to kubernetes 1.16
Resolved	akosiaris	T276204 EQIAD and CODFW : 5of VMs requested for kubernetes master
Resolved	JMeybohm	T262527 Update to kernel 4.19 on kubernetes nodes
Resolved	akosiaris	T277191 Update Kubernetes cluster codfw to kubernetes 1.16
Resolved	JMeybohm	T277741 Update Kubernetes cluster eqiad to kubernetes 1.16
Resolved	akosiaris	T278356 Update kubernetes-client

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

JMeybohm added subtasks: T241076: Define the plan for the upgrade of kubernetes cluster to a security supported release, Restricted Task, T207804: Upgrade Calico, Restricted Task.Oct 20 2020, 3:01 PM

JMeybohm renamed this task from Upgrade production kubernetes clusters to a security supported version to Upgrade kubernetes clusters to a security supported (LTS) version.Oct 20 2020, 3:11 PM

JMeybohm updated the task description. (Show Details)

JMeybohm merged a task: T241076: Define the plan for the upgrade of kubernetes cluster to a security supported release.Oct 20 2020, 3:48 PM

JMeybohm updated the task description. (Show Details)Oct 20 2020, 3:54 PM

JMeybohm updated the task description. (Show Details)

JMeybohm updated the task description. (Show Details)Oct 29 2020, 10:40 AM

JMeybohm updated the task description. (Show Details)Oct 29 2020, 12:51 PM

JMeybohm closed subtask T266032: Test deployment-charts for kubernetes 1.19 compatibility as Resolved.Oct 30 2020, 12:15 PM

JMeybohm added a subtask: T251305: Migrate to helm v3.Nov 3 2020, 3:15 PM

JMeybohm updated the task description. (Show Details)Nov 3 2020, 3:19 PM

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

We are not able to go > 1.16 because of helm2 only supporting 1.16
We can't upgrade to helm3 because that requires k8s >= 1.13

Those 2 are our big issue right now.

We can't stay on/upgrade to < 1.16 because calico needs at least 1.16

That's for calico 3.16. But it is true. No point in re initializing a cluster if we are going to do anything less than 3.16.

So best bet is currently to update to k8s 1.16 which gives us ingress and CRD support. From that we need to migrate to helm3 and afterwards we are able to continue to k8s > 1.16 (whatever makes sense than).

Also +1 on the overall plan. Some more details

Quite possibly go the full cluster reinit way and go the latest version

Yeah, +1

Decide decide if we are going to be staying with direct access to etcd (version 3?) or try and switch to the kubernetes APIs

The big issues we currently have with the etcd backed datastore is that the information in there is not tracked anyway. It is backed up but fully unsearchable. So we definitely want to at least test using the API.

JMeybohm updated the task description. (Show Details)Nov 4 2020, 9:06 AM

JMeybohm updated the task description. (Show Details)Nov 4 2020, 9:23 AM

In T244335#6600895, @akosiaris wrote:

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.

Decide decide if we are going to be staying with direct access to etcd (version 3?) or try and switch to the kubernetes APIs

The big issues we currently have with the etcd backed datastore is that the information in there is not tracked anyway. It is backed up but fully unsearchable. So we definitely want to at least test using the API.

Agreed. Will copy to T266895

JMeybohm mentioned this in T266895: Decide if we want to stick with etcd datastore.Nov 4 2020, 10:47 AM

In T244335#6602317, @JMeybohm wrote:

In T244335#6600895, @akosiaris wrote:

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.

Yes, per https://github.com/projectcalico/calico/pull/3963 it does look that way indeed. Disregard in that case.

JMeybohm closed subtask T266766: Build new kubernetes packages as Resolved.Nov 5 2020, 6:51 PM

JMeybohm added a subtask: T228967: Set up PodSecurityPolicies in clusters.Nov 19 2020, 11:04 AM

JMeybohm closed subtask T266893: Build calico 3.17.0 as Resolved.Nov 24 2020, 3:30 PM

akosiaris added a subtask: T268747: codfw: 4 VM request for kubernetes staging.Nov 25 2020, 1:42 PM

akosiaris closed subtask T268747: codfw: 4 VM request for kubernetes staging as Resolved.Dec 3 2020, 9:39 AM

Change 645041 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] k8s-codfw-staging: Add DNS RRs

https://gerrit.wikimedia.org/r/645041

gerritbot added a project: Patch-For-Review.Dec 3 2020, 9:41 AM

Change 645048 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s: Remove default values for some parameters

https://gerrit.wikimedia.org/r/645048

Change 645049 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::kubernetes: fold infrastructure_config to profile

https://gerrit.wikimedia.org/r/645049

Change 645050 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::kubernetes::node: Remove toolforge customizations

https://gerrit.wikimedia.org/r/645050

Change 645052 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] deployment_server: Add k8s-staging-codfw

https://gerrit.wikimedia.org/r/645052

Change 645053 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] kube-apiserver: Use the infrastructure users file directly

https://gerrit.wikimedia.org/r/645053

Change 645054 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s:apiserver: Manage kube user/group

https://gerrit.wikimedia.org/r/645054

Change 645048 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s: Remove default values for some parameters

https://gerrit.wikimedia.org/r/645048

Change 645049 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::kubernetes: fold infrastructure_config to profile

https://gerrit.wikimedia.org/r/645049

Change 645050 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::kubernetes::node: Remove toolforge customizations

https://gerrit.wikimedia.org/r/645050

Change 645053 merged by Alexandros Kosiaris:
[operations/puppet@production] kube-apiserver: Use the infrastructure users file directly

https://gerrit.wikimedia.org/r/645053

Change 645052 merged by Alexandros Kosiaris:
[operations/puppet@production] deployment_server: Add k8s-staging-codfw

https://gerrit.wikimedia.org/r/645052

Change 645041 merged by Alexandros Kosiaris:
[operations/dns@master] k8s-codfw-staging: Add DNS RRs

https://gerrit.wikimedia.org/r/645041

Change 645054 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s:apiserver: Manage kube user/group

https://gerrit.wikimedia.org/r/645054

Change 645410 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] Order k8s_infrastructure_users by id

https://gerrit.wikimedia.org/r/645410

Change 645411 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] k8s_infrastructure_users: fix type of client-infrastructure

https://gerrit.wikimedia.org/r/645411

Change 645410 merged by JMeybohm:
[labs/private@master] Order k8s_infrastructure_users by id

https://gerrit.wikimedia.org/r/645410

Change 645411 merged by JMeybohm:
[labs/private@master] k8s_infrastructure_users: fix type of client-infrastructure

https://gerrit.wikimedia.org/r/645411

JMeybohm mentioned this in rLPRIb67fb0016b12: Order k8s_infrastructure_users by id.Dec 4 2020, 6:33 PM

JMeybohm mentioned this in rLPRI90861b7e4ea4: k8s_infrastructure_users: fix type of client-infrastructure.

Change 648166 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Move non-common kubernetes staging values to DC specific files

https://gerrit.wikimedia.org/r/648166

Change 648166 abandoned by Alexandros Kosiaris:
[operations/puppet@production] Move non-common kubernetes staging values to DC specific files

Reason:
Squashed in Ia4dc53f2ec836e614f11ff57845ee80e771ce762. Good catch!

https://gerrit.wikimedia.org/r/648166

Change 648192 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Enable k8s-staging prometheus instance in codfw

https://gerrit.wikimedia.org/r/648192

Change 648193 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Add k8s-staging prometheus instance datasource

https://gerrit.wikimedia.org/r/648193

Change 648240 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Add wmf-node-authorization ClusterRoleBinding

https://gerrit.wikimedia.org/r/648240

Change 648240 merged by jenkins-bot:
[operations/deployment-charts@master] Add wmf-node-authorization ClusterRoleBinding

https://gerrit.wikimedia.org/r/648240

Discussion as of today: "We messed up looking at changelogs and figure out dependencies"

We reimaged and rolled out kubestaging2* only to see the kubernetes 1.16 kubelet not working with our current docker version:
failed to run Kubelet: failed to create kubelet: docker API version is older than 1.26.0

According to https://docs.docker.com/engine/api/#api-version-matrix API version 1.26 means at least docker 1.13.1, we're running 1.12.6 currently.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md#external-dependencies

The list of validated docker versions has changed. 1.11.1 and 1.12.1 have been removed. The current list is 1.13.1, 17.03, 17.06, 17.09, 18.06, 18.09. (#72823, #72831)

We decided to go forward to docker 18.09 (that's what is in buster and what WMCS clusters are using) but unfortunately docker failed to provide devicemapper in that version (https://github.com/docker/for-linux/issues/452) so we will have to import and use 18.06 (to be as close as possible with an at least WMCS tested docker version).

JMeybohm removed a subtask: T270081: kube-apiserver errors for calico CRDs.Dec 14 2020, 10:22 AM

Change 648192 merged by JMeybohm:
[operations/puppet@production] Enable k8s-staging prometheus instance in codfw

https://gerrit.wikimedia.org/r/648192

Change 648193 merged by JMeybohm:
[operations/puppet@production] Add k8s-staging prometheus instance datasource

https://gerrit.wikimedia.org/r/648193

Change 649841 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/kubernetes@future] Bump debian/changelog for config file changes

https://gerrit.wikimedia.org/r/649841

Change 649841 merged by JMeybohm:
[operations/debs/kubernetes@future] Bump debian/changelog for config file changes

https://gerrit.wikimedia.org/r/649841

As of https://gerrit.wikimedia.org/r/c/operations/puppet/+/648356 we're now running staging-codfw with docker 18.06.3 and it looks good so far.

Most deployments are running ok in staging-codfw. Things failing are changeprop and api-gateway (probably due to networking/firewall issues, nothing docker/k8s related we assume).

JMeybohm closed subtask T270298: kubernetes tmpfiles references path below legacy directory /var/run/ as Resolved.Jan 4 2021, 5:39 PM

JMeybohm closed subtask T270302: Restart kubernetes services on package upgrades as Resolved.

JMeybohm closed subtask T271702: kubestage200* change on every puppet run as Resolved.Jan 21 2021, 9:10 AM

changeprop/changeprop-jobqueue:

Has some environment specific config (site: staging) that might not work well with staging being in two DC's now
Uses kafka broker and nutcracker server list per DC, so staging currently points to the eqiad ones.

api-gateway:

Does not look bad to me. Containers are running (with some restarts for envoy, but nothing to bad)

I guess some connections (to kafka or nutcracker?) are not allowed from codfw to eqiad. For other services, we will most likely have comparable situations as well, where staging deployments rely on the fact that staging is/was eqiad only. Unfortunately this goes against our idea of being able to transparently switch staging from one DC to the other - at least without extending charts to know about that fact.

JMeybohm added a subtask: T273866: Kubernetes prod and staging share the same cluster key in hiera.Feb 8 2021, 10:16 AM

JMeybohm closed subtask T276305: Update Kubernetes cluster staging-eqiad to kubernetes 1.16 as Resolved.Mar 5 2021, 9:56 AM

akosiaris closed subtask T275618: Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name as Resolved.Mar 9 2021, 3:26 PM

JMeybohm closed subtask T269835: Implement switching of staging clusters as Resolved.Mar 11 2021, 1:48 PM

JMeybohm mentioned this in T262527: Update to kernel 4.19 on kubernetes nodes.Mar 11 2021, 4:09 PM

JMeybohm added a subtask: T262527: Update to kernel 4.19 on kubernetes nodes.

RhinosF1 subscribed.Mar 16 2021, 3:57 PM

akosiaris closed subtask T277191: Update Kubernetes cluster codfw to kubernetes 1.16 as Resolved.Mar 23 2021, 8:03 AM

JMeybohm closed subtask T262527: Update to kernel 4.19 on kubernetes nodes as Resolved.Mar 23 2021, 4:11 PM

JMeybohm added a subtask: T277741: Update Kubernetes cluster eqiad to kubernetes 1.16.

JMeybohm closed subtask T228967: Set up PodSecurityPolicies in clusters as Resolved.Mar 23 2021, 4:14 PM

JMeybohm reopened subtask T228967: Set up PodSecurityPolicies in clusters as Open.Mar 23 2021, 4:19 PM

JMeybohm closed subtask Restricted Task as Resolved.Mar 24 2021, 11:01 AM

JMeybohm closed subtask T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 as Resolved.Mar 24 2021, 1:08 PM

aborrero closed subtask Restricted Task as Resolved.Mar 24 2021, 4:43 PM

akosiaris closed subtask T207804: Upgrade Calico as Resolved.Apr 2 2021, 9:16 AM

akosiaris closed subtask T278356: Update kubernetes-client as Resolved.Apr 2 2021, 1:46 PM

JMeybohm closed subtask T228967: Set up PodSecurityPolicies in clusters as Resolved.Apr 6 2021, 9:26 AM

akosiaris closed subtask T275641: Clean up/Consolidate kubernetes related dashboards as Resolved.Apr 6 2021, 10:12 AM

JMeybohm closed subtask T269461: k8s_infrastructure_users: rsyslog and echostore share the same id as Resolved.Apr 8 2021, 10:58 AM

JMeybohm closed subtask T270063: kube-apiserver flag --admission-control has been deprecated as Resolved.May 11 2021, 10:04 AM

taavi subscribed.Aug 22 2021, 11:38 AM

Jelto changed the status of subtask T251305: Migrate to helm v3 from Open to In Progress.Nov 4 2021, 2:44 PM

Jelto closed subtask T251305: Migrate to helm v3 as Resolved.Jan 28 2022, 1:30 PM

JMeybohm closed subtask T273866: Kubernetes prod and staging share the same cluster key in hiera as Resolved.Feb 3 2022, 1:27 PM

JMeybohm added a subscriber: elukey.Mar 11 2022, 9:46 AM

JMeybohm added a subtask: Restricted Task.Mar 14 2022, 10:15 AM

JMeybohm removed a subtask: T270191: Add kubernetes 1.17+ topology annotations.May 9 2022, 5:12 PM

JMeybohm removed a subtask: T278329: Support multiple kubernetes versions with puppet.

JMeybohm removed a subtask: T270271: Target Sources (component/kubernetes-future/source/Sources) is configured multiple times.May 9 2022, 5:17 PM

JMeybohm removed a subtask: T300499: Migrate from command line flags to config files for kubernetes components.

JMeybohm removed a subtask: Restricted Task.

JMeybohm renamed this task from Upgrade kubernetes clusters to a security supported (LTS) version to Upgrade kubernetes clusters v1.16.May 9 2022, 5:20 PM

JMeybohm renamed this task from Upgrade kubernetes clusters v1.16 to Upgrade kubernetes clusters to v1.16.

JMeybohm closed this task as Resolved.

Upgrade kubernetes clusters to v1.16Closed, ResolvedPublic53 Estimated Story PointsActions