Page MenuHomePhabricator

Upgrade kubernetes clusters to a security supported (LTS) version
Open, MediumPublic53 Estimated Story Points

Description

We are currently at 1.12.9. This is no longer security supported as a release.

As of Kubernetes 1.19, bugfix support via patch releases for a Kubernetes minor release has increased from 9 months to 1 year.

But:

  • We are not able to go 1.19 because of calico 3.16 (current version) only supporting 1.18
  • We are not able to go > 1.16 because of helm2 only supporting 1.16
  • We can't upgrade to helm3 because that requires k8s >= 1.13
  • We can't stay on/upgrade to < 1.16 because calico needs at least 1.16

So best bet is currently to update to k8s 1.16 which gives us ingress and CRD support. From that we need to migrate to helm3 and afterwards we are able to continue to k8s > 1.16 (whatever makes sense than).

For the actual upgrade of clusters we have:

K8s upgrades

This is the reinitialize k8s cluster (e.g. don't really update stuff) plan.

  • Add 1.19 to CI kubeyaml (T266032)
  • Build k8s 1.16 (T266766)
  • Read a lot of changelog
  • Set up the kubernetes codfw staging cluster with stretch (to at least keep the current docker version) + kernel 4.19 + k8s 1.16
    • Prove 1.16 is ok and all (use a more sophisticated wording here :P)
    • Do we test the /admin part of deployment-charts in CI? (we don't T266670)
    • Watch out changed things
      • renamed metrics (probably)
      • Kubernetes daemons (probably changed logging) log to logstash
    • Switch staging reference to point to codfw
  • Reinitialize codfw with 1.16
  • Reinitialize eqiad with 1.16
  • Migrate to helm3 (T251305)

Calico upgrades T207804

  • Prep work for moving the egress policy to charts has been done by the contractor
    • Probably double check that all rules are set up in the charts
  • Quite possibly go the full cluster reinit way and go the latest version
    • Decide decide if we are going to be staying with direct access to etcd (version 3?) or try and switch to the kubernetes APIs (T266895)
  • Build the calico debs, and cni debs, calico-node docker image
  • Test in a staging cluster (probably during reinit as well?).

Related Objects

StatusSubtypeAssignedTask
Openakosiaris
Resolvedapakhomov
Duplicateakosiaris
Resolvedakosiaris
Resolvedapakhomov
ResolvedJMeybohm
Resolvedakosiaris
Resolvedakosiaris
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedJMeybohm
OpenNone
ResolvedJMeybohm
ResolvedJMeybohm
OpenJMeybohm
ResolvedJMeybohm
OpenNone
ResolvedJMeybohm
ResolvedJoe
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedJMeybohm
Resolvedakosiaris
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedJMeybohm
OpenNone
ResolvedJMeybohm
ResolvedJMeybohm
ResolvedJMeybohm
OpenNone
Resolvedakosiaris
ResolvedJMeybohm
ResolvedJMeybohm
Resolvedakosiaris
ResolvedJMeybohm
Resolvedakosiaris
ResolvedJMeybohm
OpenNone
OpenNone
Resolvedakosiaris

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
akosiaris lowered the priority of this task from High to Medium.
akosiaris set the point value for this task to 53.

Mentioned in SAL (#wikimedia-operations) [2020-02-05T09:57:08Z] <akosiaris> upload kubernetes 1.13.12 to apt.wikimedia.org stretch-wikimedia/main T244335

Important release notes for 1.13.x that affect us

kube-apiserver
    The deprecated etcd2 storage backend has been removed. Before upgrading a kube-apiserver using --storage-backend=etcd2, etcd v2 data must be migrated to the v3 storage backend, and kube-apiserver invocations changed to use --storage-backend=etcd3.

We should be ok on that front, although we still need to upgrade the eqiad cluster to etcd3

Use of the --node-labels flag to set labels under the kubernetes.io/ and k8s.io/ prefix will be subject to restriction by the NodeRestriction admission plugin in future releases

Although we do want to enable the NodeRestriction plugin at some point, we haven't yet, mostly due to the problems with managing the per node accounts.

On the plus side:

Include CRD for BGPConfigurations, needed for calico 2.x to 3.x upgrade.

\o/

UDP connections now support graceful termination in IPVS mode

but it looks like it's been reverted in 1.13.6 with

IPVS: Disable graceful termination for UDP traffic to solve issues with high number of UDP connections (DNS / syslog in particular) (#77802, @lbernail)
Add metrics-port to kube-proxy cmd flags

Interesting, we need to see what this exports.

[IPVS] Allow for transparent kube-proxy restarts

We don't use IPVS yet in kube-proxy but we 've been meaning to evaluate

Mentioned in SAL (#wikimedia-operations) [2020-02-05T10:24:46Z] <akosiaris> T244335 upgrade kubernetes-master on neon.eqiad.wmnet (staging)

Mentioned in SAL (#wikimedia-operations) [2020-02-05T10:50:01Z] <akosiaris> T244335 upgrade kubernetes-node on kubestage1002.eqiad.wmnet to 1.13.12

JMeybohm renamed this task from Upgrade production kubernetes clusters to a security supported version to Upgrade kubernetes clusters to a security supported (LTS) version.Oct 20 2020, 3:11 PM
JMeybohm updated the task description. (Show Details)
JMeybohm updated the task description. (Show Details)

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

We are not able to go > 1.16 because of helm2 only supporting 1.16
We can't upgrade to helm3 because that requires k8s >= 1.13

Those 2 are our big issue right now.

We can't stay on/upgrade to < 1.16 because calico needs at least 1.16

That's for calico 3.16. But it is true. No point in re initializing a cluster if we are going to do anything less than 3.16.

So best bet is currently to update to k8s 1.16 which gives us ingress and CRD support. From that we need to migrate to helm3 and afterwards we are able to continue to k8s > 1.16 (whatever makes sense than).

+1

Also +1 on the overall plan. Some more details

Quite possibly go the full cluster reinit way and go the latest version

Yeah, +1

Decide decide if we are going to be staying with direct access to etcd (version 3?) or try and switch to the kubernetes APIs

The big issues we currently have with the etcd backed datastore is that the information in there is not tracked anyway. It is backed up but fully unsearchable. So we definitely want to at least test using the API.

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.

Decide decide if we are going to be staying with direct access to etcd (version 3?) or try and switch to the kubernetes APIs

The big issues we currently have with the etcd backed datastore is that the information in there is not tracked anyway. It is backed up but fully unsearchable. So we definitely want to at least test using the API.

Agreed. Will copy to T266895

We are not able to go 1.19 because of calico only supporting 1.18

Looks like this isn't true. Judging from https://github.com/projectcalico/calico/commit/21a45a4a141fff03b251fde2f1ab77fbb0c903ee#diff-f386c272afd3d855bf9f1d3609d1782962951258a58e2b298df60c70b16517ee, the calico 3.16 requirements page will be updated soon.

Not sure about that. The commit is from 2nd Sept. and never made it to the 3.16 release branch. I would guess it's for 3.17.

Yes, per https://github.com/projectcalico/calico/pull/3963 it does look that way indeed. Disregard in that case.

Change 645041 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] k8s-codfw-staging: Add DNS RRs

https://gerrit.wikimedia.org/r/645041

Change 645048 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s: Remove default values for some parameters

https://gerrit.wikimedia.org/r/645048

Change 645049 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::kubernetes: fold infrastructure_config to profile

https://gerrit.wikimedia.org/r/645049

Change 645050 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] profile::kubernetes::node: Remove toolforge customizations

https://gerrit.wikimedia.org/r/645050

Change 645052 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] deployment_server: Add k8s-staging-codfw

https://gerrit.wikimedia.org/r/645052

Change 645053 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] kube-apiserver: Use the infrastructure users file directly

https://gerrit.wikimedia.org/r/645053

Change 645054 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s:apiserver: Manage kube user/group

https://gerrit.wikimedia.org/r/645054

Change 645048 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s: Remove default values for some parameters

https://gerrit.wikimedia.org/r/645048

Change 645049 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::kubernetes: fold infrastructure_config to profile

https://gerrit.wikimedia.org/r/645049

Change 645050 merged by Alexandros Kosiaris:
[operations/puppet@production] profile::kubernetes::node: Remove toolforge customizations

https://gerrit.wikimedia.org/r/645050

Change 645053 merged by Alexandros Kosiaris:
[operations/puppet@production] kube-apiserver: Use the infrastructure users file directly

https://gerrit.wikimedia.org/r/645053

Change 645052 merged by Alexandros Kosiaris:
[operations/puppet@production] deployment_server: Add k8s-staging-codfw

https://gerrit.wikimedia.org/r/645052

Change 645041 merged by Alexandros Kosiaris:
[operations/dns@master] k8s-codfw-staging: Add DNS RRs

https://gerrit.wikimedia.org/r/645041

Change 645054 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s:apiserver: Manage kube user/group

https://gerrit.wikimedia.org/r/645054

Change 645410 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] Order k8s_infrastructure_users by id

https://gerrit.wikimedia.org/r/645410

Change 645411 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] k8s_infrastructure_users: fix type of client-infrastructure

https://gerrit.wikimedia.org/r/645411

Change 645410 merged by JMeybohm:
[labs/private@master] Order k8s_infrastructure_users by id

https://gerrit.wikimedia.org/r/645410

Change 645411 merged by JMeybohm:
[labs/private@master] k8s_infrastructure_users: fix type of client-infrastructure

https://gerrit.wikimedia.org/r/645411

Change 648166 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Move non-common kubernetes staging values to DC specific files

https://gerrit.wikimedia.org/r/648166

Change 648166 abandoned by Alexandros Kosiaris:
[operations/puppet@production] Move non-common kubernetes staging values to DC specific files

Reason:
Squashed in Ia4dc53f2ec836e614f11ff57845ee80e771ce762. Good catch!

https://gerrit.wikimedia.org/r/648166

Change 648192 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Enable k8s-staging prometheus instance in codfw

https://gerrit.wikimedia.org/r/648192

Change 648193 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] Add k8s-staging prometheus instance datasource

https://gerrit.wikimedia.org/r/648193

Change 648240 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Add wmf-node-authorization ClusterRoleBinding

https://gerrit.wikimedia.org/r/648240

Change 648240 merged by jenkins-bot:
[operations/deployment-charts@master] Add wmf-node-authorization ClusterRoleBinding

https://gerrit.wikimedia.org/r/648240

Discussion as of today: "We messed up looking at changelogs and figure out dependencies"

We reimaged and rolled out kubestaging2* only to see the kubernetes 1.16 kubelet not working with our current docker version:
failed to run Kubelet: failed to create kubelet: docker API version is older than 1.26.0

According to https://docs.docker.com/engine/api/#api-version-matrix API version 1.26 means at least docker 1.13.1, we're running 1.12.6 currently.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md#external-dependencies

The list of validated docker versions has changed. 1.11.1 and 1.12.1 have been removed. The current list is 1.13.1, 17.03, 17.06, 17.09, 18.06, 18.09. (#72823, #72831)

We decided to go forward to docker 18.09 (that's what is in buster and what WMCS clusters are using) but unfortunately docker failed to provide devicemapper in that version (https://github.com/docker/for-linux/issues/452) so we will have to import and use 18.06 (to be as close as possible with an at least WMCS tested docker version).

Change 648192 merged by JMeybohm:
[operations/puppet@production] Enable k8s-staging prometheus instance in codfw

https://gerrit.wikimedia.org/r/648192

Change 648193 merged by JMeybohm:
[operations/puppet@production] Add k8s-staging prometheus instance datasource

https://gerrit.wikimedia.org/r/648193

Change 649841 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/kubernetes@future] Bump debian/changelog for config file changes

https://gerrit.wikimedia.org/r/649841

Change 649841 merged by JMeybohm:
[operations/debs/kubernetes@future] Bump debian/changelog for config file changes

https://gerrit.wikimedia.org/r/649841

As of https://gerrit.wikimedia.org/r/c/operations/puppet/+/648356 we're now running staging-codfw with docker 18.06.3 and it looks good so far.

Most deployments are running ok in staging-codfw. Things failing are changeprop and api-gateway (probably due to networking/firewall issues, nothing docker/k8s related we assume).

changeprop/changeprop-jobqueue:

  • Has some environment specific config (site: staging) that might not work well with staging being in two DC's now
  • Uses kafka broker and nutcracker server list per DC, so staging currently points to the eqiad ones.

api-gateway:

  • Does not look bad to me. Containers are running (with some restarts for envoy, but nothing to bad)

I guess some connections (to kafka or nutcracker?) are not allowed from codfw to eqiad. For other services, we will most likely have comparable situations as well, where staging deployments rely on the fact that staging is/was eqiad only. Unfortunately this goes against our idea of being able to transparently switch staging from one DC to the other - at least without extending charts to know about that fact.

JMeybohm closed subtask Restricted Task as Resolved.Mar 24 2021, 11:01 AM
aborrero closed subtask Restricted Task as Resolved.Mar 24 2021, 4:43 PM