Page MenuHomePhabricator

Refactor scap's kubernetes DeploymentsConfig to support selection of image kinds
Closed, ResolvedPublic

Description

Here, an image "kind" (new term) represents the specific type of an image (e.g., the debug image) and is roughly analogous to an image build target.

In contrast, the image "flavour" (existing term) is a particular configuration-variant / instantiation of that type (i.e., a specific configuration of build-args).

Making "kind" configurable in the DeploymentsConfig would allow scap to deploy the new mediawiki-cli image to mw-cron, mw-script, etc.

Initial proposal:

Option 1 (validated enum)

Deprecate the current debug boolean in favor of a kind enum. kind would be one of:

  • production (default, used by all standard mw-on-k8s deployments)
  • debug (used by mw-debug)
  • cli (used by mw-cron and mw-script, as well as dumps)

During DeploymentsConfig processing, scap validates that the kind is one of the supported types.

Option 2 (opaque key)

Deprecate the current debug boolean in favor of a kind config field.

  • This corresponds directly to the name by which the image is referred in the build report provided by build-images.py - e.g., debug-image.
  • In scap, this field is treated as an opaque key used when inspecting the build report, similar to mw_flavour today.

After image build complete, but before mutating the files in /etc/helmfile-defaults/mediawiki/release, scap validates that all requested combinations of kind and flavour are known.

Decision: Option 2

Event Timeline

Taking a step back, there are a couple of ways we could go about this. IMO, the two most obvious are as follows:

One option is what we discussed in IRC, along the lines of the current task description: Deprecate the debug boolean in favor of an enum that is one of production (default), debug, or cli. Scap would then (1) validate that the config specifies a supported kind, (2) map said kind to the relevant image key in the build report, and (3) (as today) automatically map debug-image releases to the testservers stage.

The fact that this is more rigid - e.g., must validate the kind value is supported - has both up and downsides. The upside is that we can catch bad values early, well before we start building images. The downside is that adding a new kind requires both a build-images.py change and a scap change.

I'm not sure I would cast #3 (i.e., still supporting automatically mapping debug-image releases to testservers stage) as either an up or downside, but it does again rely on inspecting the kind.

Another option is to treat kind as a simple pass-through value similar to mw_flavour - i.e., don't validate at all, and just use it for build-report lookup (e.g., defaulting to plain "image" if None). This has the benefit of supporting new image kinds a one-step procedure, at a cost of not catching unsupported kinds until we've already built images when we inspect the build report, having possibly already committed changes to /etc/helmfile-defaults/mediawiki/releases.

If we go this route, I'd propose we still deprecate the debug boolean and remove the logic for #3 - i.e., it's up to the deployments config to properly map releases to stages, including the testservers stage.

@Clement_Goubert @dancy - Do either of you have opinions on these two options? Both introduce rakes to step on, but the first does so in a way that is easier to reason about and caught early in scap execution (at config load).

Scott_French renamed this task from Refactor scap's kubernetes DeploymentsConfig to Refactor scap's kubernetes DeploymentsConfig to support selection of image kinds.Mar 20 2025, 5:59 PM

I think the second option is sound and would greatly simplify adding new images.

If we go with this option, I think we could avoid committing bad or incomplete changes to /etc/helmfile-defaults/mediawiki/releases by moving the find_image_flavour check out of _update_helmfile_values_for and doing a global check before it. We would compile a list of expected images from DeploymentsConfig, check they're all present in the build report, and either error out if they are not, or proceed with updating the releases file.

There is also the option of validating the list of flavours and kinds passed to build-images.py there and erroring out in the build process if it doesn't know how to handle them.

I agree with deprecating the debug boolean.

Thanks, @Clement_Goubert.

So, I realized in retrospect that adding a new image kind of course already requires a scap change as things exist now, in order to add the appropriate flags to the build-images.py invocation (e.g., the supplying the image name). If we anticipate that being how things work for the foreseeable future, then that's also the natural point at which one would add the enum member and associated image key - i.e., there is no "extra" scap change involved.

Now, I'd optimistically say that we should aim to get rid of this coupling as much as possible, and adding to it does not help advance that goal.

With that in mind, I'd say that I'm weakly leaning toward the second option - i.e., treat the new mw_kind (or similarly named field) as an opaque value like mw_flavour.

As for detecting inconsistency with the build-report, I think the right compromise here is to (1) build the images, (2) validate and collect helmfile values from the build report, (3) apply helmfile values updates. That means that, in the event of a mw_kind misconfiguration, in the worst case we've wasted time building the images.

The reason this is preferable to validating before build is that otherwise, two scap runs are needed in order to use a new mw_kind (i.e., since it would need to already exist in the last build).

As you point out, an alternative might be to pass the kinds / flavours we depend on to build-images.py, and have it validate upfront that it understands them. That said, I do worry about adding more complexity to the interactions between scap and build-images.py.

Anyway, I think this all highlights that there are improvements that could be made in terms of how image build details flows between scap and build-images.py and back. What images (kinds x flavours) should be built and with what parameters, should ideally be in one place.

I'd just like to pick up on something that was mentioned in the description:

cli (used by mw-cron and mw-script, as well as dumps (not deployed by scap afaik))

Just to be clear, we are intending to integrate scap with the deployment of helmfile.d/dse-k8s-services/mediawiki-dumps-legacy and its main release.
(The release is actually called production at the moment, but we will rename it, for consistency.)

This ticket T388761: scap needs to be k8s-cluster aware was required to enable this deployment by scap to non-wikikube clusters.
Although we haven't yet added this dse-k8s-eqiad/mediawiki-dumps-legacy/mediawiki-cli deployment information to the list of mediawiki releases, we hope to be able to do this soon.

The only resource that will be in this release though, will be the suspended job template using a mediawiki pod template, so no actual processes.
We then retrieve this job template with Airflow at job execution time and use it to build the pod spec for dump tasks.

Change #1130683 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] [WIP] Configure a scap deployment of mediwiki-dumps-legacy

https://gerrit.wikimedia.org/r/1130683

Thanks, @BTullis - I was aware of Joe's recent scap change to support cluster_dir in DeploymentsConfig for this use case, but wasn't up to speed on the mechanics (e.g., updating a suspended job - in fact, I didn't know you could mutate the template of a suspended job).

Taking a quick look at your WIP patch (https://gerrit.wikimedia.org/r/1130683), you're going to want to switch mw_flavour to publish-81 (assuming you want PHP 8.1), add mw_kind set to cli-image (at least as I'm envisioning this at the moment), and presumably you'll need to configure cluster_dir in some way.

I'll try to get my proposed scap change posted later today or first thing tomorrow.

swfrench opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/702

Draft: kubernetes: support image "kind" and deprecate the debug boolean

Change #1131058 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Profile::Mediawiki_deployment: add mw_kind fields

https://gerrit.wikimedia.org/r/1131058

Change #1131059 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] hieradata: replace debug with mw_kind and stage in mw_releases

https://gerrit.wikimedia.org/r/1131059

Change #1131060 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Profile::Mediawiki_deployment: remove deprecated debug field

https://gerrit.wikimedia.org/r/1131060

The patch series starting at https://gerrit.wikimedia.org/r/1131058 is mainly for illustrative purposes at the moment, and will need to evolve if the proposal in https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/702 does. However, that should be sufficient to get our production configuration up to speed with the changes supported there.

A separate patch would be needed for train-dev before we can remove support for the debug boolean.

Here, an image "kind" represents the specific type of an image (e.g., the debug image), while a "flavour" is a particular configuration-variant / instantiation of that type (e.g., a specific configuration of build-args).

I am having a bit of trouble understanding exactly what these two "dimensions" actually are mapping to. Naming is hard as we all know. Ideally we would end up with names that make the correct model appear in people's heads. "Type" and "variant" or "kind" and "flavor" [there was a typo there ;)] or are there more concrete things like "base image" and "config"?

I am having a bit of trouble understanding exactly what these two "dimensions" actually are mapping to. Naming is hard as we all know. Ideally we would end up with names that make the correct model appear in people's heads. "Type" and "variant" or "kind" and "flavor" [there was a typo there ;)] or are there more concrete things like "base image" and "config"?

Entirely fair question :)

So, a possibly adequate analogy for the relationship between image "kind" (a new name for an existing concept that did not previously have a name) and image "flavour" (an existing name for an existing concept) would be that between a class definition and the specific set of constructor arguments that produce a concrete instantiation. So, the debug-image kind instantiated with the parameters defined by the publish-81 flavour is a debug image was configured to use PHP 8.1 at build time.

The abstractness of "kind" (or, alternatively, we were talking about "type" at one point) is intentional, because the implementation of these concepts - i.e., what build steps implement a kind, what parameters define a flavour - are encapsulated in make-container-image et al.

If we were to converge these details into a single tool, for example, then there's a lot of opportunity to make these more concrete. For now, though, we need something "shaped like" the existing boundary between these tools.

So, a possibly adequate analogy for the relationship between image "kind" (a new name for an existing concept that did not previously have a name) and image "flavour" (an existing name for an existing concept) would be that between a class definition and the specific set of constructor arguments that produce a concrete instantiation.

That's a really helpful analogy!

The terms I see most often in image build tool chains are build "arguments" and build "target". The latter is short for "target stage" when discussing Dockerfile inputs, and in Blubber/Kokkuri it's "target variant", but both are considered the "target" when actually building.

In the current make-container-image implementation, each named flavor is indeed just a predefined set of build arguments. From my understanding of what you want to support here, kind will likely map directly to a build target in make-container-image and the underlying Dockerfiles.

My 2 cents would be to canonicalize towards the terms "build target" and "build arguments" (or "build arguments set"), but I think so long as we establish and document the relationship between flavor/build-arg-set and kind/build-target, those terms are also totally workable.

Thanks, @dduvall!

The terms I see most often in image build tool chains are build "arguments" and build "target". The latter is short for "target stage" when discussing Dockerfile inputs, and in Blubber/Kokkuri it's "target variant", but both are considered the "target" when actually building.

Ah, I like that - "target" does a good job of grounding what we mean here by "kind" in more general build tool chain terminology.

In the current make-container-image implementation, each named flavor is indeed just a predefined set of build arguments. From my understanding of what you want to support here, kind will likely map directly to a build target in make-container-image and the underlying Dockerfiles.

Exactly, yeah: make-container-image now produces mediawiki multiversion images for three different kinds / targets: the default production image ("image") and then the two special-case images that derive from it ("debug-image" and "cli-image" - the latter recently added in T389484). Our goal here is to allow the DeploymentsConfig to refer to these kinds by name, and have scap look up the appropriate image for that kind from the build report (rather than only supporting the "image" vs. "debug-image" distinction).

My 2 cents would be to canonicalize towards the terms "build target" and "build arguments" (or "build arguments set"), but I think so long as we establish and document the relationship between flavor/build-arg-set and kind/build-target, those terms are also totally workable.

I think that sounds like a solid direction in which to evolve this. In the very near term, though, I might slightly prefer to go the "introduce kind, keep flavour, document their parallels to target and build args, respectively" approach if you're amenable to it.

Specifically, I'd like to think a little bit first about how the "API" between scap and make-container-image might evolve to reflect this before starting to evolve the nomenclature. Luckily, adopting "kind" for now should not complicate that (e.g., in contrast to "flavour" changing, introducing "kind" has no implications for the content of the build report as it exists now).

dancy merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/702

kubernetes: support image "kind" and deprecate the debug boolean

Alright, once a scap release is deployed that picks up https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/702, I'll move forward with the pending puppet patches to update mediawiki-deployments.yaml.

I'll also need to:

  1. Update the docs in https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#Add_a_mw-on-k8s_deployment
  2. Update the train-dev mediawiki-deployments.yaml to use the newer format as well
  3. Remove support for debug in scap

Change #1131058 merged by Scott French:

[operations/puppet@production] Profile::Mediawiki_deployment: add mw_kind fields

https://gerrit.wikimedia.org/r/1131058

Change #1131059 merged by Scott French:

[operations/puppet@production] hieradata: adopt mw_kind in mw_releases

https://gerrit.wikimedia.org/r/1131059

Mentioned in SAL (#wikimedia-operations) [2025-04-02T16:10:34Z] <swfrench-wmf> run-puppet-agent on deploy1003 to pick up mediawiki-deployments.yaml changes - T389499

Change #1133466 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Profile::Mediawiki_deployment: make debug Optional

https://gerrit.wikimedia.org/r/1133466

Change #1133466 merged by Scott French:

[operations/puppet@production] Profile::Mediawiki_deployment: make debug Optional

https://gerrit.wikimedia.org/r/1133466

Mentioned in SAL (#wikimedia-operations) [2025-04-02T16:23:00Z] <swfrench@deploy1003> Started scap sync-world: Deployment to pick up change in mediawiki-deployments.yaml - T389499

Mentioned in SAL (#wikimedia-operations) [2025-04-02T16:24:10Z] <swfrench@deploy1003> swfrench: Deployment to pick up change in mediawiki-deployments.yaml - T389499 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-02T16:26:01Z] <swfrench@deploy1003> Finished scap sync-world: Deployment to pick up change in mediawiki-deployments.yaml - T389499 (duration: 03m 21s)

Change #1131060 merged by Scott French:

[operations/puppet@production] Profile::Mediawiki_deployment: remove deprecated debug field

https://gerrit.wikimedia.org/r/1131060

Thanks, @BTullis - I was aware of Joe's recent scap change to support cluster_dir in DeploymentsConfig for this use case, but wasn't up to speed on the mechanics (e.g., updating a suspended job - in fact, I didn't know you could mutate the template of a suspended job).

Taking a quick look at your WIP patch (https://gerrit.wikimedia.org/r/1130683), you're going to want to switch mw_flavour to publish-81 (assuming you want PHP 8.1), add mw_kind set to cli-image (at least as I'm envisioning this at the moment), and presumably you'll need to configure cluster_dir in some way.

I'll try to get my proposed scap change posted later today or first thing tomorrow.

Thanks all for your efforts on this. I have modified my patch as per your guidance and removed the debug key.

However, I'm still not sure how to proceed with regard to getting support for the cluster_dir parameter, so that we can deploy to the dse-k8s cluster.
Should I be looking at adding this to modules/profile/types/mediawiki_deployment.pp myself, or would it better for your team to do that?

@Scott_French - since you mentioned that you are a little unclear on the mechanics of how we use this suspended job, I can fill you in a little.
We don't have to mutate the job template after deployment. What we do the following:

image.png (1×1 px, 343 KB)

Then we will sync each of the dumps files to the distribution servers clouddumps100[1-2], using additional chained Airflow tasks.
There is still a way to go before it is production-ready, but when this is finished it should allow us to decom (and reuse) snapshot101[0-7] and dumpsdata100[3-7].

Thanks for the additional details on the mechanics, @BTullis!

[ ... ]
Thanks all for your efforts on this. I have modified my patch as per your guidance and removed the debug key.

I just left a couple of additional comments on your patch. Happy to chat if you have any questions.

However, I'm still not sure how to proceed with regard to getting support for the cluster_dir parameter, so that we can deploy to the dse-k8s cluster.
Should I be looking at adding this to modules/profile/types/mediawiki_deployment.pp myself, or would it better for your team to do that?

I just sent a patch for that this morning: https://gerrit.wikimedia.org/r/1135464.

Once that's merged and you integrate the comments on your patch, you should be able to get a successful PCC run.

Scott_French triaged this task as Medium priority.

Since no further work / cleanup remains here, I'm going to resolve this.

@BTullis - We can follow up on the remaining configuration to make use of this on your https://gerrit.wikimedia.org/r/1130683 and / or T389786: Integrate mediawiki-dumps-legacy with the regular MW scap deployments.

Change #1130683 merged by Brouberol:

[operations/puppet@production] Configure a scap deployment of mediwiki-dumps-legacy

https://gerrit.wikimedia.org/r/1130683