Page MenuHomePhabricator

🔷 Upgrade kubernetes from 1.21 to 1.22
Closed, ResolvedPublic8 Estimated Story Points

Description

We are currently using version 1.21.x, a version set for retirement/EOL the 28th this month

We should upgrade this to run on a newer version locally and for our staging/production environments

This would involve resolving the deprecation of some of our v1beta usage, mostly ingresses it seems but there could be a lot of hidden gems/problems that would appear once we start doing this.

AC

  • Decide for a new target version 1.22
  • Upgrade all environments to that new target version

Useful links:

Event Timeline

See https://cloud.google.com/kubernetes-engine/docs/release-notes for possible targets for us to aim for. I would suggest we aim for 1.22 as the Regular version

I looks like to more forwards to 1.22 on GKE we are only blocked by two APIs we call that will no longer be around:

image.png (232×833 px, 25 KB)

APIUser agentTotal calls (last 30 days)Last called
/apis/networking.k8s.io/v1beta1/ingressesnginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format1269814 Aug 2022, 05:04:00
/apis/extensions/v1beta1/ingressesGo-http-client/2.0188 Aug 2022, 15:54:00

Looks to me like the only real thing required would be to update the version of the nginx-ingress charts. This however probably does result in us needing to remove the "pinned" bitnami charts repository

Tarrow renamed this task from Upgrade kubernetes from 1.21 to Upgrade kubernetes from 1.21 to 1.22.Aug 25 2022, 1:17 PM
Tarrow set the point value for this task to 8.Aug 25 2022, 1:19 PM
Tarrow moved this task from Tech prioritized backlog to Ready to Pick Up on the Wikibase Cloud board.

I refactored the UI chart in this PR to use the new stable ingress API. more details here: https://github.com/wbstack/charts/pull/104

I created similar PRs for the API, QueryService, and QueryService UI charts as well

Tarrow renamed this task from Upgrade kubernetes from 1.21 to 1.22 to 🔵 Upgrade kubernetes from 1.21 to 1.22.Nov 8 2022, 3:26 PM
Tarrow renamed this task from 🔵 Upgrade kubernetes from 1.21 to 1.22 to 🔷 Upgrade kubernetes from 1.21 to 1.22.Nov 8 2022, 3:28 PM
Rosalie_WMDE renamed this task from 🔷 Upgrade kubernetes from 1.21 to 1.22 to 🔷 Upgrade kubernetes from 1.21 to 1.25.Nov 9 2022, 11:24 AM
Rosalie_WMDE renamed this task from 🔷 Upgrade kubernetes from 1.21 to 1.25 to 🔷 Upgrade kubernetes from 1.21 to 1.22.Nov 9 2022, 11:39 AM

@Rosalie_WMDE and I had a chat. Right now we want to at a minimum upgrade to 1.22 since this is supported by GKE (https://cloud.google.com/kubernetes-engine/docs/release-notes#current_versions). GKE also supports 1.23 and 1.24 but not 1.25 yet. We probably want to move up the versions incrementally because some services that we use (ingress-nginx/nginx-ingress for example) don't support a wide enough version range for us to make the jump in one step.

Today we upgraded the staging clusters wbaas-2 control plane and node-pools to 1.22.15-gke.1000. This was a 3-step process (control plane, oldest node-pool medium-pool, the other node-pool) and was done via the Google Cloud Console UI (docs).
Note: In the future, if we activate auto-upgrade for the node pools, upgrading just the control plane will be enough.

During the upgrade the following alerts fired:

Throughout the upgrade of the second node pool (standard-pool, 4 nodes) there popped up a message in the Cloud Console indicating that the upgrades are a bit delayed, possibly because of pod disruption budgets or grace periods.

The current upgrade delay of 7 minutes per node indicates your Pod Disruption Budgets may need attention. Learn more

Everything was completed after roundabout ~1h20m

The upgrade on wbaas-3 was concluded today as well, pretty much the same experience as with staging yesterday.

What was noticeable is that some node upgrades were delayed again, this time I spotted the reason: missing set of elastic search availability - it waited because only 1 pod was allowed to be unavailable at any time, and currently the time until one of them reports as ready is around ~40m after pod start.

The production cluster is now running on 1.22.15-gke.1000