Page MenuHomePhabricator

charlie wiped cluster redeployment use-case
Closed, ResolvedPublic

Description

charlie was used for the first time in the context of a full cluster wipe and redeploy during T405703: Update wikikube eqiad to kubernetes 1.31. While it was very useful, I think we should think about implementing a specific workflow in charlie for this particular use-case.

This would include:

  • Asking for confirmation in the beginning, and maybe checking that the cluster is indeed in a wiped state before allowing for the full force-redeploy
  • Deploying any DaemonSet containing releases first in the process so they are sure to find all nodes empty and available (especially mw-mcrouter)
  • Addind a kind of priority system to releases so we can tag important releases that need to be deployed early to minimize their downtime (in this particular upgrade that would have been toolhub, kartotherian, tegola, thumbor)
  • Not showing diffs unless a first try of a helmfile sync/apply has been made and failed
  • Not asking for confirmation during the run unless a first try has been made and failed, perhaps asking once between each of DaemonSet stage, priority stage, all the rest stage
  • Finishing up by launching a scap deployment of MediaWiki

Thoughts? Additional features we would want for this use-case?

Event Timeline

Thanks for this! I hadn't originally thought about using charlie this way. For my use case (applying the same diff to every service, like an Envoy upgrade) the "just deploy everything without asking me" feature is tempting but also an obviously terrible idea, hence why it's deliberately not supported. But for your use case it's perfect.

What's the best query to check the cluster is wiped? One option is just a command-line flag like --i_solemnly_swear_it_is_okay_to_obliterate_the_state_of_this_kubernetes_cluster. (It has the advantage that if charlie is interrupted mid-run it can be restarted in this mode even after recreating some objects. But that can be accomplished with a confirmation prompt too.) A kube api call or etcd read would be ideal -- I had a look at the wipe-cluster cookbook but nothing immediately jumped out at me. (CC @JMeybohm on this question too.)

On release ordering: For a quick first iteration, would you be happy with just an ordered list of priority services, and we can manually include mw-mcrouter and friends at the front of it for now? Then detecting those automatically is a straightforward followup patch.

I'm on the fence about adding scap at the end -- conceptually it feels like a different task from "just run helmfile a bunch of times" and I don't want this thing to get bloated. On the other hand, that would mean the instructions for humans are just "run charlie and then run scap" and at that point it ought to be scripted. So I'm leaning toward including it, as long as it doesn't add too much complexity, which it seems like it shouldn't.

What's the best query to check the cluster is wiped? One option is just a command-line flag like --i_solemnly_swear_it_is_okay_to_obliterate_the_state_of_this_kubernetes_cluster. (It has the advantage that if charlie is interrupted mid-run it can be restarted in this mode even after recreating some objects. But that can be accomplished with a confirmation prompt too.) A kube api call or etcd read would be ideal -- I had a look at the wipe-cluster cookbook but nothing immediately jumped out at me. (CC @JMeybohm on this question too.)

I think something tantamount to kubectl get namespace | wc -l being lower than some N (maybe N=5?) would be good enough here for Charlie work.

What's the best query to check the cluster is wiped? One option is just a command-line flag like --i_solemnly_swear_it_is_okay_to_obliterate_the_state_of_this_kubernetes_cluster. (It has the advantage that if charlie is interrupted mid-run it can be restarted in this mode even after recreating some objects. But that can be accomplished with a confirmation prompt too.) A kube api call or etcd read would be ideal -- I had a look at the wipe-cluster cookbook but nothing immediately jumped out at me. (CC @JMeybohm on this question too.)

Not really, because we would run charlie after having ran the helmfile sync for admin_ng which creates all the namespaces. I think the easiest may be to check that the only pods deployed on the cluster are on these namespaces:

cert-manager
default
istio-system
kube-node-lease
kube-public
kube-system

On release ordering: For a quick first iteration, would you be happy with just an ordered list of priority services, and we can manually include mw-mcrouter and friends at the front of it for now? Then detecting those automatically is a straightforward followup patch.

That's perfectly fine.

I'm on the fence about adding scap at the end -- conceptually it feels like a different task from "just run helmfile a bunch of times" and I don't want this thing to get bloated. On the other hand, that would mean the instructions for humans are just "run charlie and then run scap" and at that point it ought to be scripted. So I'm leaning toward including it, as long as it doesn't add too much complexity, which it seems like it shouldn't.

The scap call that we do manually isn't really much more than "just run helmfile a bunch of times" but with logstash checks and whatnot. I'm not too hung up on it being integrated to charlie tbh, we can also just run it manually at the end. I don't think it should be integrated in the cookbook though given it is wikikube specific.

Change #1196989 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add --priority to charlie

https://gerrit.wikimedia.org/r/1196989

Change #1196990 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add --dangerously_fast to charlie

https://gerrit.wikimedia.org/r/1196990

Change #1196989 merged by RLazarus:

[operations/puppet@production] deployment_server: Add --priority to charlie

https://gerrit.wikimedia.org/r/1196989

Change #1196990 merged by RLazarus:

[operations/puppet@production] deployment_server: Add --dangerously_fast to charlie

https://gerrit.wikimedia.org/r/1196990

What's the best query to check the cluster is wiped?

Sorry, late to the party due to PTO. I don't think there is a sufficiently future-proof way to detect that. So I would opt for a cli flag for sake of simplicity.

jijiki triaged this task as High priority.Thu, Jan 22, 1:45 PM
jijiki moved this task from Inbox to Needs Info / Blocked on the ServiceOps new board.
jijiki added a project: ServiceOps new.
jijiki removed a project: serviceops.