Page MenuHomePhabricator

scap should optionally display helmfile diffs for review
Closed, ResolvedPublic

Description

Feature: imbue scap with the ability to collect and display helmfile diff output for review prior to deployment.

While this is not something we would want to enable by default, it would be quite useful when verifying infrastructure changes deployed with scap sync-world (i.e., the preferred path for changes that affect all MW deployments).

Details

TitleReferenceAuthorSource BranchDest Branch
Optionally collect, display, and confirm helmfile diffsrepos/releng/scap!295swfrenchwork/swfrench/T362717-k8s-confirm-diffsmaster
Customize query in GitLab

Event Timeline

I have a proof-of-concept patch for this, the result of which is shown in P60688 (using train-dev).

This front-loads review for all stages, which is preferable vs. per-stage so that we don't end up needing to cancel (i.e., upon an unexpected diff) mid-way through a deployment.

I think my main open question is just how early we want to insert review.

One reasonable option is right before deployment starts: evaluate diffs as close as possible in time to when we actually start running helmfile apply. This minimizes the window where an unexpected helmfile or chart change can sneak in (i.e., not part of a reviewed diff).

The downside of such an approach is that, when image builds are enabled, cancelling at that stage leaves behind the newly built image and the updated image tags in /etc/helmfile-defaults/mediawiki/release/....

We could expose some of the currently private revert logic from scap.kubernetes to clean up in this case, but that also brings a fair amount of complexity.

Given that, another reasonable option is to lift review as early as possible, say right after we acquire the scap lock [0]. This has the benefit of avoiding side-effects, but also extends the window unexpected changes can sneak in.

I'm going to ask around a bit to get opinions on this.

[0] https://gitlab.wikimedia.org/repos/releng/scap/-/blob/e88974a6b23c6d915075a84859ec139967ad26d1/scap/main.py#L106

If we're really worried about that race condition, is it plausible to do this?

  • Run helmfile diff early, right after getting the scap lock. Print the output for the operator to approve, but also store all of it.
  • Then run helmfile diff again, silently, right before helmfile apply.
    • If the output is the same as the stored output from earlier (should be true almost always), continue with the apply.
    • If the output has changed, either present it for re-approval or just scream and exit. (Optionally, clean up the build artifacts, but maybe now we've made it rare enough that we can live with it.)

This approach would fail if there are harmless inconsistencies, e.g. the same diffs presented in a different order -- I don't know offhand if helmfile diff is liable to do that kind of thing.

It might take a few seconds longer, and it might also be overengineered for how bad a problem this is.

Thanks, Reuven! I think it should be feasible to do something like that, yes.

From a quick glance, it looks like the helm-diff plugin used by helmfile should be consistent across invocations in terms of how it represents the before / after manifests for diffing.

It might take a few seconds longer, and it might also be overengineered for how bad a problem this is.

Agreed, it would definitely add a few seconds to each stage, though only when the review-the-diffs feature is on (which by default it won't be).

Also, yes - whether the risk / likelihood of a racing helmfile or chart change is great enough to worry about in the first place is one of the main questions on my mind ...

If it is something to worry about, then I wonder whether it's better to field a simple solution initially, while engineering effort is better invested in making deployments hermetic in this regard.

Hmmm ... I just realized that "ignoring" the image tag change (in the image-build case) when comparing the initial and pre-apply rounds of diffs will be challenging (or hacky).

We could get around that by doing the initial round of review after the build, but then we're back to dealing with cleanup on cancellation.

This points me back toward the simple solution: a single round of diffs, right after we take the lock.

While there is a race, it's still arguably better than manually running helmfile diff in a loop and reviewing before invoking scap sync-world.

Also, operators that would be interested in this feature are likely to be explicitly coordinating changes to MW-affecting charts / values (i.e., it's more guarding against latent diffs, and / or diffs that don't look like what we expect from first-principles).

And if we do later invest in making helmfile evaluation hermetic during deployments, then it "just works" without the caveat.

That sounds reasonable! Note for the future that helm diff has a --suppress-output-line-regex which does exactly what you'd like it to do, but it's not available in the version we're currently running.

Thanks, Reuven. Once that's available, it would be good to figure out whether it's straightforward to make the regex precise enough to suppress just the MW image change (vs. others in the same pod). I think that should be possible, e.g., as long as the image name remains stable.

Scott_French added a subscriber: dancy.

Many thanks to @dancy for the review.

I think I'm happy with this simple solution for now. The fact that the diffs may become outdated if concurrent chart or helmfile value changes are merged during sync is noted in the review dialog.

A couple of directions we could take this in the future:

  • Add a diff-of-diffs check like that described in T362717#9720685.
  • Make scap hermetic w.r.t. chart or helmfile changes (this would be rather involved).
  • Consider _always_ checking for diffs, and requiring review if they are non-empty (i.e., since "normal" MW deployments are generally expected not to pick up diffs other than their own image-version change).

I'll give these all some thought.