charlie was used for the first time in the context of a full cluster wipe and redeploy during T405703: Update wikikube eqiad to kubernetes 1.31. While it was very useful, I think we should think about implementing a specific workflow in charlie for this particular use-case.
This would include:
- Asking for confirmation in the beginning, and maybe checking that the cluster is indeed in a wiped state before allowing for the full force-redeploy
- Deploying any DaemonSet containing releases first in the process so they are sure to find all nodes empty and available (especially mw-mcrouter)
- Addind a kind of priority system to releases so we can tag important releases that need to be deployed early to minimize their downtime (in this particular upgrade that would have been toolhub, kartotherian, tegola, thumbor)
- Not showing diffs unless a first try of a helmfile sync/apply has been made and failed
- Not asking for confirmation during the run unless a first try has been made and failed, perhaps asking once between each of DaemonSet stage, priority stage, all the rest stage
- Finishing up by launching a scap deployment of MediaWiki
Thoughts? Additional features we would want for this use-case?