Page MenuHomePhabricator

scap needs to be k8s-cluster aware
Closed, ResolvedPublic

Description

We will soon need to deploy to multiple kubernetes cluster from scap (as part of making dumps work there). This means that we need to change the structure of the DeploymentsConfig class by adding a specific kube_cluster_prefix to the path under which to run helmfile.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Allow deployments to specify custom k8s clustersrepos/releng/scap!795swfrenchwork/swfrench/T388761-deployment-clustersmaster
Allow multiple kubernetes clusters to be usedrepos/releng/scap!681oblivianmultiple_kube_clustersmaster
Customize query in GitLab

Event Timeline

It is looking like the Pretrain project will need this too so it can target the staging cluster with scap managed deployments.

Mentioned in SAL (#wikimedia-operations) [2025-03-20T19:13:21Z] <dancy@deploy2002> Started scap sync-world: T388761

Mentioned in SAL (#wikimedia-operations) [2025-03-20T19:24:37Z] <dancy@deploy2002> Finished scap sync-world: T388761 (duration: 11m 15s)

Change #1135464 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Profile::Mediawiki_deployment: add 'dir' field

https://gerrit.wikimedia.org/r/1135464

Change #1135464 merged by Scott French:

[operations/puppet@production] Profile::Mediawiki_deployment: add 'dir' field

https://gerrit.wikimedia.org/r/1135464

Connecting some dots here that I forgot to add yesterday:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130683 as it currently exists, with deploy: false for the one extant release, should be fine to merge as is.

However, before that's removed - i.e., so that scap actually attempts to apply changes to mediawiki-dumps-legacy - we need to sort out customization of helmfile environments.

Specifically, as of now, scap will attempt to apply changes for the common set of environments defined in the k8s_clusters config key, which currently contains codfw and eqiad (from here). If I understand correctly, dse-k8s-eqiad is really the sole environment we want scap to operate on here.

I can take a closer look at this when I return the week of the 28th.

@brouberol FYI, since I know you've picked this up.

@Scott_French I'm not familiar with the scap codebase. Is this as simple as adding dse-k8s-eqiad under scap::k8s_deployments::clusters, or do we need to patch scap itself?

@brouberol - This would require changes to scap, specifically the ability to override the set of environments relevant to a particular deployment (rather than using the "defaults" provided by the k8s_clusters config key).

This would not be terribly complicated outright, but it does run into some interesting naming consistency questions (e.g., "cluster" vs. "datacenter" in scap's code for managing k8s deployments).

In any case, I can take a quick look at look at this later today to see what the changes would entail. While I'm not quite sure what Joe had in mind for this aspect, this seems sufficiently straightforward that the approach described above likely does not diverge from that.

Following up, I did get a chance to sketch this out, and indeed (1) it's not all that complicated in practice but (2) it does run head-first into the naming consistency question I mentioned.

IMO, I think it makes sense to standardize on either "clusters" (i.e., in the k8s sense) or "environments" (i.e., in the helmfile sense) as a more-correct terminology, in both configuration and code, as it's more accurate than "datacenter" - i.e., k8s clusters / helmfile environments (which by convention we associate 1:1 for selecting cluster-specific configuration) do not map 1:1 with datacenters (e.g., eqiad the DC contains both the eqiad wikikube cluster and the dse-k8s-eqiad cluster).

In any case, I see that Joe is going to be back next week, so at this point I might wait until then to confirm with him that he did not have a different solution in mind, before posting my MR for discussion.

IMO, I think it makes sense to standardize on either "clusters" (i.e., in the k8s sense) or "environments" (i.e., in the helmfile sense) as a more-correct terminology, in both configuration and code, as it's more accurate than "datacenter" - i.e., k8s clusters / helmfile environments (which by convention we associate 1:1 for selecting cluster-specific configuration) do not map 1:1 with datacenters (e.g., eqiad the DC contains both the eqiad wikikube cluster and the dse-k8s-eqiad cluster).

My preference would be to use the "environment", as per helmfile. but feel free to change my mind. :-)

Finally had a chance to polish my MR a bit and post it today (draft). I'll take one more look on Monday before sending it for review for real-real, but folks are welcome to take a look before that if so inclined.

[ ... ]
My preference would be to use the "environment", as per helmfile. but feel free to change my mind. :-)

I was very much on the fence between these two options. In short, what tipped me in the "cluster" direction is that, ultimately, our use of helmfile environments as a mechanism to select k8s-cluster-specific configuration values is an implementation detail, and really what we're referring to here are k8s cluster names. The description on [0] expands on this a bit.

[0] https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/795

Change #1148480 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/puppet@production] Profile::Mediawiki_deployment: add 'clusters' field

https://gerrit.wikimedia.org/r/1148480

Change #1148480 merged by Scott French:

[operations/puppet@production] Profile::Mediawiki_deployment: add 'clusters' field

https://gerrit.wikimedia.org/r/1148480

Mentioned in SAL (#wikimedia-operations) [2025-05-27T15:52:48Z] <swfrench@deploy1003> Started scap sync-world: Noop deployment to test scap 4.170.0 - T388761

Mentioned in SAL (#wikimedia-operations) [2025-05-27T15:56:21Z] <swfrench@deploy1003> Finished scap sync-world: Noop deployment to test scap 4.170.0 - T388761 (duration: 04m 03s)

Mentioned in SAL (#wikimedia-operations) [2025-06-03T17:19:46Z] <swfrench@deploy1003> Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - T388761 T389786

Although changes to mediawiki-dumps-legacy will be needed before this feature can actually be put to use there (details in T389786#10881115), we were still able to "successfully" test this functionality today, and indeed it appears to work as expected.

We can probably resolve this, unless we also want to use this to track refactoring of how k8s-cluster "groups" (e.g., dse-k8s) are associated with their helmfile.d subdir and cluster list in config (discussion on https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/795).

BTullis claimed this task.
BTullis moved this task from Backlog to Done on the Dumps-Generation board.

I think that we can call this done.

We're now automatically getting our updated mediawiki pod spec whenever scap deploys mediawiki and this contains the latest image spec.
This image and pod spec is then getting used for all subsequent dump, without any further interaction on our part.

image.png (709×1 px, 162 KB)

Many thanks for all of your help with this, @Scott_French and @Joe in particular.