Page MenuHomePhabricator

Proposal to move kubernetes upgrades to blue green deploy
Closed, DeclinedPublic

Description

This ticket is for discussion of changing the k8s upgrade process. Please comment with any thoughts, opinions, or views on the subject.

Our current upgrade method has no rollback ability. We test the upgrade on toolsbeta (what are our tests?), if it looks good we go on to toolforge, then paws. This works well, and runs smoothly. However if we were to miss anything in toolsbeta, and the upgrade failed, we could be in a failed state for an unknown time, during which k8s for toolforge would be down. Additionally if we do have to upgrade to new VMs the process is lengthy. Proposed is to move to an blue green style deploy which should confer a few benefits.
An upgrade would start from the beginning, thus every upgrade would test our disaster recovery process.
Additionally should we find that an upgrade fails, it was never in production, no one would notice.
Finally should a upgrade be found to have failures after it is in production, we should only have to switch back to the old cluster to be in the state we were in before the upgrade.

Current upgrade method:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes

Deploy method:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying

Event Timeline

rook renamed this task from Proposal to move kubernetes upgrades to A/B deploy to Proposal to move kubernetes upgrades to blue green deploy.Oct 18 2021, 5:56 PM
rook updated the task description. (Show Details)

I have a bunch of questions, both for lack of knowledge and high level of the proposal itself, so I'll throw them here, anyone is welcome to reply to each of them (I'll do also when I find the answers xd), these are not arguments in favor or against, just questions I'd like to clear to be able to discuss the solutions:

  • This means duplicating the amount of VMs during each deploy right?
    • If so, will this include automating the creation/addition of the new VMs? (we currently have only cookbooks to add an etcd node to an existing cluster, and a worker node to an existing cluster)
  • Will this include doing the blue/green deploy on toolsbeta too?
    • If not, will we remove toolsbeta?
      • If not, how will toolsbeta keep updated?
  • Is this for toolforge and paws, or just one of them?
  • If this is meant for toolforge too, how will we "redeploy" all the users projects?
    • Afaik they can manually edit the kuberenetes configs, so a simple 'webservice start' will not work
    • Will we be dumping the etcd DB and restoring it in the new cluster?
    • Will we be dumping all the resources (deployments, etc.) and importing them in the new cluster?
    • Some of them are not meant to have more than one instance running at a time (access to same NFS files, etc.), how do we work around that if we start one instance on each version cluster?
  • Can you elaborate a bit more on the proposed process itself? (create new VMs, setup control nodes, dump resources from old version, ...)

Thanks for the task!

A few points:

We just got started with the whole automation thing. I think we selected our automation framework like 8 months ago, spicerack (cc @dcaro). I think we all agree that everything we do from now on, we should try an automation-first approach. That would make us closer to actual SREs :-) And will improve our services of course.

Additional trivia: we don't have automation for deploying openstack (!!). I myself was the last person who bootstrapped an openstack deployment (eqiad1) and that was like 3 years ago. As opposed to k8s, we don't even have clear docs on the steps required. I think today this is way more concerning that our lack of automation for k8s. The cadence for updating openstack & kubernetes are similar (they release every 6 months) and the problems you described are relevant to openstack as well. With the additional detail that everything we do (including k8s) depends on openstack.

Indeed an interesting and important topic, although quite challenging as we don't really manage the workloads running inside the cluster.

In the task description, @mdipietro wrote:

We test the upgrade on toolsbeta (what are our tests?)

I've usually just tested stopping/starting pods and tested via manual curl commands that web requests get routed through correctly via the ingress setup (which is lacking the front proxy part on toolsbeta). This area could indeed use some improvement.

The cadence for updating openstack & kubernetes are similar (they release every 6 months)

Kubernetes releases a new version every 4 months (and until recently every 3 months).

@dcaro

  • This means duplicating the amount of VMs during each deploy right?
    • If so, will this include automating the creation/addition of the new VMs? (we currently have only cookbooks to add an etcd node to an existing cluster, and a worker node to an existing cluster)
  • Will this include doing the blue/green deploy on toolsbeta too?

These are all true

  • Is this for toolforge and paws, or just one of them?

It would be for both. Much of the underlying code would be the same, a little extra for specifics of one cluster or the other

  • If this is meant for toolforge too, how will we "redeploy" all the users projects?
    • Afaik they can manually edit the kuberenetes configs, so a simple 'webservice start' will not work

This will make things tricky.

  • Will we be dumping the etcd DB and restoring it in the new cluster?
  • Will we be dumping all the resources (deployments, etc.) and importing them in the new cluster?

Ideally no, the hope would be to have a disaster recovery solution as our deploy. As such not using the old cluster to deploy the new.

  • Some of them are not meant to have more than one instance running at a time (access to same NFS files, etc.), how do we work around that if we start one instance on each version cluster?

This will make things tricky.

  • Can you elaborate a bit more on the proposed process itself? (create new VMs, setup control nodes, dump resources from old version, ...)

The way that I've had success with this in the past is deploying an entire cluster starting from VMs. Then deploying all the software to it after it is up. Then adding it to the load balancer alongside the old cluster. Then pulling down the old cluster.

@aborrero

  • If so, that would be a major automation project. I mean, it would be 100% nice to have, but we should be mindful of https://xkcd.com/1205/

This is true, the time would never be repaid, the value is in potential uptime, and consistency of deploys (removing manual steps and tweaks)

@Majavah

although quite challenging as we don't really manage the workloads running inside the cluster.

This is true

The primary potential benefits of this are in the situation that the k8s cluster goes down, either due to an upgrade or some random disaster. Right now it would simply be down, and the time to get it back would be awhile, I would guess about a day of downtime if we were trying to bring it back from nothing. That we don't control the workloads, and that some will not work well if more than one instance is running are cogent points. The flexibility that we offer is likely to be found inversely proportional to the uptime that we can offer. My perspective on this is likely not fitting the situation. Would we agree that we are willing to accept a wider band of, potential, downtime simply because of the flexibility it allows us to offer in terms of what kind of services people can run? Which is to say we don't want it to go down, but if it were to we are willing to accept a day of downtime and sending out a note to please restart any services that wouldn't come back from webservice start?

By the way, the cluster state is stored in etcd.

We've had conversation in the past on how to best approach an etcd disaster, evaluating different backup strategies. We could evaluate that for the purpose of improving the overall resilience of Toolforge, in an incremental way.

Also, a bit related: T285904: Migrate paws k8s to using separate local-disk etcd servers

if it were to we are willing to accept a day of downtime and sending out a note to please restart any services that wouldn't come back from webservice start

This is the current situation afaik xd
So we have no alternative that accept it's what it is right now (and hopefully improve it)

I'm 100% onboard on the idea of being able to rebuild from scratch (I actually suggested a similar strategy a year ago when I joined), but it's not easy, points that will help considerably here:

  • Moving to buildpacks (that way there's no NFS involved directly, with the goal of users not needing direct mangling of kubernetes objects, and thus us being able to restart a service if needed).
  • Getting the automation to spin up/create the VMs (on the works, this will help with disaster recovery if we have the etcd backups mentioned here https://phabricator.wikimedia.org/T293675#7440735).

So given that stuff, is it something you are still willing to pursue?
Do you want to take some of the tasks that will unblock this? (either buildpacks or automation of VMs creation, though both are kinda long term)

Just an FYI, the testing was historically a matter of checking that all custom controllers continue working (by exercising them, like running a basic set of webservice commands or something) and ensuring that maintain-kubeusers is functioning. If there are more concerning changes in the upgrade, I'd also use the utility https://github.com/toolforge/toolsctl to create a toolsbeta tool to make sure everything got created in our automation toolchain without anything breaking. No way would we consider upgrading so often as the y'all have been because that's too much manual work for the time given. We'd been aiming in the past at a 6 month cycle, but we actually got behind because of the other work going on.

This is presuming tests weren't already passed nicely locally or something. Most recent upgrades involved all the work already done on much of this before hand. The toolsbeta environment is both a "test" for upgrades and a replacement for a viable local development environment for the current state of toolforge (with an entire toolchain including grid in order to validate other software used in toolforge like webservice). If you change k8s upgrade, you have a long way to go before you lose the usefulness of toolsbeta (random comment on something said above).

We had not yet reached a point where the checks of each part of the upgrade process (compatibility of all parts of the toolchain ensured) were automated. If they were, and therefore the upgrade process itself was automated more, you'd have a pretty good idea if the upgrade was going to go alright by comparison to now. I might suggest from over here that it might be a good idea to consider how to automate such checking as part of the upgrade process (with or without blue-green deployment) and that would make things a lot easier to automate in general. Wouldn't it be better to find all those things first and sort out a way to make sure they are all checked? You could deploy them between two clusters all you want now, and you'd have no idea either way unless you manually hunt them down anyway. The workloads of users is something you'd only hear about later anyway.

I have no opinion for or against blue green deployment of the k8s component other than what has already been said (that it's a workload-focused concern that would require a lot of VMs, etc.). There's a lot to catch in a problematic upgrade so sounds cool if you can do it!

Hope that's useful.
/me sneaks away

Considering the flexibility we offer in how people run their workloads I'm going to close this out as infeasible. Please reopen if you consider otherwise.