We need/want to upgrade our Kubernetes clusters to Kubernetes v1.23
Kubernetes v1.23 was selected as a target:
- because it is the last version supporting dockershim (and we don't want to move away from that together with this update)
- v1.24 was only released 2022-05-03 and we usually don't want to upgrade to a just released major/miner
Together with the Kubernetes update, we need to update the following other components (subtasks might be a good idea for each of them when more details available):
- Calico to 3.23 (tested and supported with k8s v1.23)
- There was an issue with 3.17.1, see T271422
- Istio to v1.13 (tested and supported with k8s v1.23)
- cert-manager to v1.8 (tested and supported with k8s v1.23)
- Update cfssl-issuer dependencies, T310486
- kserve to 0.8 recommended-version-matrix
- K8s v1.22 not mentioning K8s v1.23)
- Istio v1.11,v1.12
- cert-manager >=1.3
- knative-serving to v1.0+
- I'm a bit confused about versioning here as they are at v1.4 already (as the only maintained version?) with 1.0 released November 2021.
- The ML team has 0.18 deployed, but they jumped from the 0.2x releases to 1.0 to IIUC stabilize the API that they support. We should jump, probably, to something that is at least 1.x in my opinion (Luca).
- https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/ says: "Kubernetes v1.22 or newer" - This is for version 1.4 IIUC, since they dropped support for older versions.
- I'm a bit confused about versioning here as they are at v1.4 already (as the only maintained version?) with 1.0 released November 2021.
Preparation for the Kubernetes update:
- Double check out docker API version is supported with Kubernetes v1.23 (minimum version is still 1.26.0)
- Check if migration from command line flags to config files is required for some Kubernetes components: T300499
- Looks to me like it is not yet strictly needed. It might be for some specific flags, though. That we will figure out during upgrade tests I guess.
- Ensure all our charts are compatible with Kubernetes v1.23
- Our current validation with kubeyaml does not support this and getting support for 1.23 into kubeyaml will require time that could be better invested into replacing kubeyaml in deployment-charts CI with kubeconform: T306165
- Read Kubernetes changelogs (yellow/red flags just linked below each version. Tick the box if all action required items have been addressed), https://relnotes.k8s.io (relevant versions)
- v1.17
- Action Required (or at least check)
Kubeproxy config.BindAddress now defaults to 127.0.0.1, was 0.0.0.0: https://github.com/kubernetes/kubernetes/pull/83822(is not what it sounds like)- ✅ Kube-apiserver: The AdmissionConfiguration type accepted by --admission-control-config-file has been promoted to apiserver.config.k8s.io/v1 with no schema changes. (#85098, @liggitt) - change apiVersion in k8s/manifests/apiserver.pp
- ✅ Kubeadm now includes CoreDNS version 1.6.5 (#85108,#85109)
- ✅ If given an IPv6 bind-address, kube-apiserver will now advertise an IPv6 endpoint for the kubernetes.default service. (#84727, @danwinship)
- Note
- Critical pods can now be created in namespaces other than kube-system. To limit critical pods to the kube-system namespace, cluster admins should create an admission configuration file limiting critical pods by default, and a matching quota object in the kube-system namespace permitting critical pods in that namespace. See https://kubernetes.io/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default for details. (#76310, @ravisantoshgudimetla): https://phabricator.wikimedia.org/T310618
- Graduate TaintNodesByCondition to GA: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-nodes-by-condition
- Graduate ScheduleDaemonSetPods to GA: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#how-daemon-pods-are-scheduled
- Pod process namespace sharing is now GA: https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/
- The docker container runtime now enforces a 220 second timeout on container network operations. (#71653, @liucimin)
- Fix panic in kubelet when running IPv4/IPv6 dual-stack mode with a CNI plugin (#82508, @aanm)
- Reduce default NodeStatusReportFrequency to 5 minutes (from 1 minute). With this change, periodic node status updates will be send every 5m if node status doesn't change (otherwise they are still send with 10s). (#84007, @wojtek-t)
- Action Required (or at least check)
- v1.18
- Action Required
- Note
- The following features are unconditionally enabled and the corresponding --feature-gates flags have been removed: PodPriority, TaintNodesByCondition, ResourceQuotaScopeSelectors and ScheduleDaemonSetPods (#86210, @draveness)
- Kube-proxy: Added dual-stack IPv4/IPv6 support to the iptables proxier. (#82462, @vllry)
- Support server-side dry-run in kubectl with --dry-run=server for commands including apply, patch, create, run, annotate, label, set, autoscale, drain, rollout undo, and expose. (#87714, @julianvmodesto)
- v1.19
- Action Required
seccomp graduates to GA, check if we need to migrate PSPs off the annotations https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#seccomp-graduates-to-general-availability(this only affects pod spec not PSPs, nothing to do here)Kube-apiserver: the componentstatus API is deprecated. This API provided status of etcd, kube-scheduler, and kube-controller-manager components, but only worked when those components were local to the API server, and when kube-scheduler and kube-controller-manager exposed unsecured health endpoints. Instead of this API, etcd health is included in the kube-apiserver health check and kube-scheduler/kube-controller-manager health checks can be made directly against those components' health endpoints. (#93570, @liggitt)- ✅ Kubeadm now includes CoreDNS version v1.7.0.
- ✅ Kube-apiserver: The NodeRestriction admission plugin now restricts Node labels kubelets are permitted to set when creating a new Node to the --node-labels parameters accepted by kubelets in 1.16+. (#90307, @liggitt)
- Note
- EndpointSlices, be aware (during debugging) https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/
- Fix bug in reflector that couldn't recover from "Too large resource version" errors (#92537, @wojtek-t) [SIG API Machinery]
- Kubelet: add '--logging-format' flag to support structured logging (#91532, @afrouzMashaykhi)
- Add --logging-format flag for component-base. Defaults to "text" using unchanged klog. (#89683, @yuzhiquan)
- Kube-controller-manager: add '--logging-format' flag to support structured logging (#91521, @SataQiu)
- Kube-scheduler: add '--logging-format' flag to support structured logging (#91522, @SataQiu)
- The DefaultIngressClass feature is now GA. The --feature-gate parameter will be removed in 1.20. (#91957, @cmluciano)
- The kube-controller-manager managed signers can now have distinct signing certificates and keys. See the help about --cluster-signing-[signer-name]-{cert,key}-file. --cluster-signing-{cert,key}-file is still the default. (#90822, @deads2k)2
- Kube-apiserver, kube-scheduler and kube-controller manager now use SO_REUSEPORT socket option when listening on address defined by --bind-address and --secure-port flags, when running on Unix systems (Windows is NOT supported). This allows to run multiple instances of those processes on a single host with the same configuration, which allows to update/restart them in a graceful way, without causing downtime. (#88893, @invidian)
- Action Required
- v1.20
- Action Required
- ✅ TokenRequest and TokenRequestProjection are now GA features. The following flags are required by the API server:
- ✅ --service-account-issuer, should be set to a URL identifying the API server that will be stable over the cluster lifetime.
- ✅ --service-account-key-file, set to one or more files containing one or more public keys used to verify tokens.
- ✅ --service-account-signing-key-file, set to a file containing a private key to use to sign service account tokens. Can be the same file given to kube-controller-manager with --service-account-private-key-file. (#95896, @zshihang)
- ✅ Resolves non-deterministic behavior of the garbage collection controller when ownerReferences with incorrect data are encountered. Events with a reason of OwnerRefInvalidNamespace are recorded when namespace mismatches between child and owner objects are detected. The kubectl-check-ownerreferences tool can be run prior to upgrading to locate existing objects with invalid ownerReferences: https://github.com/kubernetes-sigs/kubectl-check-ownerreferences
- ✅ In dual-stack bare-metal clusters, you can now pass dual-stack IPs to kubelet --node-ip. eg: kubelet --node-ip 10.1.0.5,fd01::0005
- ✅ In dual-stack clusters where nodes have dual-stack addresses, hostNetwork pods will now get dual-stack PodIPs.
- ✅ TokenRequest and TokenRequestProjection are now GA features. The following flags are required by the API server:
- Note
- A bug was fixed in kubelet where exec probe timeouts were not respected. This may result in unexpected behavior since the default timeout (if not specified) is 1s which may be too small for some exec probes. Ensure that pods relying on this behavior are updated to correctly handle probe timeouts.
- Kubernetes 1.20 now enables API Priority and Fairness (APF) by default.
- IPv4/IPv6 dual-stack has been reimplemented for 1.20 to support dual-stack Services: https://docs.k8s.io/concepts/services-networking/dual-stack/
- On-demand metrics calculation is now available through /metrics/resources
- kubectl alpha debug graduates from alpha to beta in 1.20, becoming kubectl debug
- Support the node label node.kubernetes.io/exclude-from-external-load-balancers (might be an idea to exclude ganeti VM nodes from LVS?)
- Action Required
- v1.21
- Action Required
- ✅ Kubeadm now includes CoreDNS v1.8.0
- ✅ New admission controller DenyServiceExternalIPs is available. Clusters which do not need the Service externalIPs feature should enable this controller and be more secure. (#97395, @thockin)
- ✅ The pause image upgraded to v3.4.1 in kubelet and kubeadm for both Linux and Windows. (#98205, @pacoxu) T322920
- ✅ Update the latest validated version of Docker to 20.10
- ✅ Upgrades IPv6Dualstack to Beta and turns it on by default. New clusters or existing clusters are not be affected until an actor starts adding secondary Pods and service CIDRS CLI flags as described here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/563-dual-stack (#98969, @khenidak)
- Note
- Pod with multiple containers can use kubectl.kubernetes.io/default-container annotation to have a container preselected for kubectl commands.
- Immutable Secrets and ConfigMaps graduates to GA. This feature allows users to specify that the contents of a particular Secret or ConfigMap is immutable for its object lifetime. For such instances, Kubelet will not watch/poll for changes and therefore reducing apiserver load. (Probably true for almost all of our configmap/secret objects as we roll-restart deployments on configmap changes anyways).
- ServiceNodeExclusion, NodeDisruptionExclusion and LegacyNodeRoleBehavior features have been promoted to GA. ServiceNodeExclusion and NodeDisruptionExclusion are now unconditionally enabled, while LegacyNodeRoleBehavior is unconditionally disabled.
- ✅ TokenRequest and TokenRequestProjection feature gates have been removed and are unconditionally enabled
- Kubelet Graceful Node Shutdown feature graduates to Beta and enabled by default.
- Namespace API objects now have a kubernetes.io/metadata.name label matching their metadata.name field to allow selecting any namespace by its name using a label selector.
- Kubectl: kubectl get will omit managed fields by default now. Users could set --show-managed-fields to true to show managedFields when the output format is either json or yaml. (#96878, @knight42)
- Action Required
- v1.22
- Action Required
- ✅ Various beta API removals. We're not affected as kubeconform would have given notice
- ✅ controller-manager changes:
- ✅ controller-manager MUST start with --authorization-kubeconfig and --authentication-kubeconfig correctly set to get authentication/authorization working
- ✅ Applications that fetch metrics from controller-manager should use a dedicated service account which is allowed to access nonResourceURLs /metrics. (#96216, @knight42)
- ✅ (don't think we use that) liveness/readiness probes to controller-manager MUST use HTTPS now, and the default port has been changed to 10257
- ✅ Updated pause image to version 3.5, which now runs per default as pseudo user and group 65535:65535. This does not have any effect on remote container runtimes like CRI-O and containerd, which setup the pod sandbox user and group on their own. (#100292, @saschagrunert) T322920
- Note
- As of now both system-node-critical and system-cluster-critical pods have -997 OOM score, making them one of the last processes to be OOMKilled. If the user wants to have the pod to be OOMKilled last and the pod has system-cluster-critical priority class, it has to be changed to system-node-critical priority class to preserve the existing behavior (#99729, @ravisantoshgudimetla)
- Server-side Apply is GA https://kubernetes.io/docs/reference/using-api/server-side-apply/
- Default/kubeadm etcd moves to version 3.5.0
- Data ccorruption issues with etcd 3.5.[0-2], use >=3.5.3
- Introducing Memory quality of service support with cgroups v2 (Alpha). The MemoryQoS feature is now in Alpha. This allows kubelet running with cgroups v2 to set memory QoS at container, pod and QoS level to protect and guarantee better memory quality. This feature can be enabled through feature gate Memory QoS. (#102970, @borgerli)
- Kube-apiserver: the alpha PodSecurity feature can be enabled by passing --feature-gates=PodSecurity=true, and enables controlling allowed pods using namespace labels. See https://git.k8s.io/enhancements/keps/sig-auth/2579-psp-replacement for more details. (#103099, @liggitt)
- The EmptyDir memory backed volumes are sized as the the minimum of pod allocatable memory on a host and an optional explicit user provided value. (#101048, @dims)
- The NamespaceDefaultLabelName is promoted to GA in this release. All Namespace API objects have a kubernetes.io/metadata.name label matching their metadata.name field to allow selecting any namespace by its name using a label selector. (#101342, @rosenhouse)
- Action Required
- v1.23
- Action Required
- ✅ Deprecation of klog specific flags (we use a bunch): https://kubernetes.io/docs/concepts/cluster-administration/system-logs/#klog
- ✅ kube-scheduler MUST start with --authorization-kubeconfig and --authentication-kubeconfig correctly set to get authentication/authorization working.
- Note
- IPv4/IPv6 Dual-stack Networking graduates to GA https://github.com/kubernetes/enhancements/tree/master/keps/sig-network/563-dual-stack
- PodSecurity graduates to Beta (In 1.23, the PodSecurity feature gate is enabled by default.) https://kubernetes.io/docs/concepts/security/pod-security-admission/
- Structured logging graduate to Beta
- Log messages in JSON format are written to stderr by default now (same as text format) instead of stdout. Users who expected JSON output on stdout must now capture stderr instead or in addition to stdout. (#106146, @pohly) (we log zu stderr anyways, so probably no change here)
- Support for the seccomp annotations seccomp.security.alpha.kubernetes.io/pod and container.seccomp.security.alpha.kubernetes.io/[name] has been deprecated since 1.19, will be dropped in 1.25. Transition to using the seccompProfile API field. (#104389, @saschagrunert)
- Ephemeral containers graduated to beta and are now available by default. (#105405, @verb)
- The TTLAfterFinished feature gate is now GA and enabled by default. (#105219, @sahilvv)
- Introduce a feature gate DisableKubeletCloudCredentialProviders which allows disabling the in-tree kubelet credential providers. (#102507, @ostrain)
- Action Required
- Read Calico changelogs
- v3.17 > 3.17.0
- All components that use Typha now use the same logic to discover Typha’s address. They lookup the endpoints of the service directly and connect to one at random. This avoids a dependency on kube-proxy. typha #466 (@fasaxc)
- kube-controllers runs a non-root by default kube-controllers #566 (@caseydavenport)
- v3.18
- Calico v3.18.0 supports advertising Service LoadBalancer IP. You can now use an external IP allocator for LoadBalancer type Services (for example, MetalLB) and Calico will advertise those addresses into your BGP infrastructure: https://docs.projectcalico.org/archive/v3.18/networking/advertise-service-ips#advertise-service-load-balancer-ip-addresses
- IPAM Prometheus Metrics: https://docs.projectcalico.org/archive/v3.18/reference/kube-controllers/prometheus
- Add helm v3 chart to GitHub release artifacts calico: https://github.com/projectcalico/calico/pull/4365
- v3.19
- Update ipables version to 1.8.4-15
- By default, limit each node to 20 IP address blocks. This value can be overridden through IPAM configuration.
- v3.20
- Service-based egress rules; Calico NetworkPolicy and GlobalNetworkPolicy now support egress rules which match on Kubernetes service names. Service matches in egress rules can be used to allow or deny access to in-cluster services, as well as services typically not backed by pods (for example, the Kubernetes API). Address and port information is learned from the individual endpoints within the service.
- Configurable BGP graceful restart timer; See the maxRestartTime configuration option in the BGPPeer API.
- calico/node marks nodes with NetworkUnavailable=true on shutdown node #993 (@song-jiang)
- Add IP address garbage collection to kube-controllers kube-controllers #744 (@caseydavenport)
- Calico will now release empty IPAM blocks from nodes that no longer need them so they can be used elsewhere. kube-controllers #799 (@caseydavenport)
- v3.21
- For users of BGP you can now view the status of your BGP routers, including session status, RIB / FIB contents, and agent health via the new CalicoNodeStatus API: https://docs.projectcalico.org/archive/v3.21/reference/resources/caliconodestatus
- Service-based ingress rules; In v3.20, we introduced egress policy rules that can match on Kubernetes services. In v3.21, we improved upon that in two ways. First, you can now use service matches in Calico NetworkPolicy and GlobalNetworkPolicy ingress rules. Second, you can now use service-based network policy rules on Windows nodes.
- Option to run Calico as non-privileged and non-root; https://docs.projectcalico.org/archive/v3.21/security/non-privileged
ACTION REQUIRED: calico/node logs write to /var/log/calico within the container by default, in addition to stdout node #1133 (@song-jiang)
- v3.22
- None
- v3.23
- Update to CNI plugins v1.1.1 calico #5944 (@caseydavenport)
- New per-pool IPAM metrics added calico #5706 (@pasanw)
- v3.17 > 3.17.0
- Read Istio changelogs and import 1.15.3 (see T322193)
- Cookbook related to the WDQS Streaming Updater and kubernetes: https://phabricator.wikimedia.org/T293063
- Cookbook to set a k8s cluster in maintenance mode: https://phabricator.wikimedia.org/T277677
- Cookbook for depooling one or all services from one kubernetes cluster: https://phabricator.wikimedia.org/T260663
- Add kubernetes 1.17+ topology annotations (automatically via Puppet): https://phabricator.wikimedia.org/T270191
- Agree strategy for Kubernetes BGP peering to top-of-rack switches: https://phabricator.wikimedia.org/T306649
- Package Kubernetes v1.23
- Update CoreDNS to 1.8.7 (see T321159)
- Package Calico v3.23
- Update calico helm charts to v3.23
- Update helm to >= 3.8.0 for k8s 1.23 support (T317511)
- Re-initialize wikikube-staging-codfw: T326340
- Re-initialize wikikube-staging-eqiad: T327664
- Update grafana dashboards and alerts (find dashboards using a specific metric via misc/search-grafana-dashboards.js in operations/software`)
- Re-initialize wikikube-codfw: T329664
- Re-initialize wikikube-eqiad: T331126
Additional things to do together with the re-init of clusters:
- Move to bigger Pod and Service IP range (T326617)
- Delete etcd v2 datastore of calico: $ etcdctl -C https://$(hostname -f):2379 rm -r /calico
- Reimage etcd clusters to bullseye (so the above is probably not needed)
- enable DenyServiceExternalIPs admission plugin