This ticket is for discussion of changing the k8s upgrade process. Please comment with any thoughts, opinions, or views on the subject.
Our current upgrade method has no rollback ability. We test the upgrade on toolsbeta (what are our tests?), if it looks good we go on to toolforge, then paws. This works well, and runs smoothly. However if we were to miss anything in toolsbeta, and the upgrade failed, we could be in a failed state for an unknown time, during which k8s for toolforge would be down. Additionally if we do have to upgrade to new VMs the process is lengthy. Proposed is to move to an A/B style deploy which should confer a few benefits.
An upgrade would start from the beginning, thus every upgrade would test our disaster recovery process.
Additionally should we find that an upgrade fails, it was never in production, no one would notice.
Finally should a upgrade be found to have failures after it is in production, we should only have to switch back to the old cluster to be in the state we were in before the upgrade.
Current upgrade method:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes
Deploy method:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying