Page MenuHomePhabricator

helmfile/scap does not reliably bootstrap mediawiki
Open, MediumPublic

Description

We could not deploy mediawiki to an empty cluster using scap sync-world --k8s-only -Dbuild_mw_container_image:False since it was missing Configmaps (maybe due to issues with the order in which objects are created). It worked when using helmfile sync rather than helmfile apply (which is what scap does).

Maybe we can make scap use helmfile sync if no existing mw releases can be found in the target cluster"


Update 2026-03-05 - See T397685#11678013 for a short summary of the current state.

Event Timeline

I was chatting with @dancy earlier today about what might have caused this, and it's kind of a puzzling one.

Super-naively, helmfile apply should(?) be doing the same thing as a helmfile sync of there is no prior state (and within a given release, helm should order object creations by type in a sensible way).

The one thing that came to mind as "definitely not supported" is sequencing of mediawiki helmfile releases vs. "support" releases in the same namespace (e.g., mediawiki-common, prometheus), since scap only updates the former (via label selectors).

However, I can't think of any examples off the top of my head where we would have a configmap dependency across releases, if the issue here was specifically missing configmaps (there are some interesting cases like network policies in the mediawiki-common case, though, where relevant).

A couple of hopefully quick follow-up questions:

  1. Working backward from the SAL, I suspect it was this sync-world that failed. Is that correct?
  2. If so, does anyone have more detail on what was missing? The scap log only shows that the mw-misc/main update timed out [0], with nothing else particularly insightful as far as I can tell. I do see that mw-debug succeeded, but that's suspiciously concurrent with this manual sync.
  3. The task description specifically mentions sync vs. apply. Did anyone happen to try a manual helmfile apply and find that it didn't work, before switching to sync?

[0] https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-7-2025.26?id=G170nJcBIhiCpMJyfTb1

This logstash search shows missing config files errors from php-fpm initialization.
The logs from the manual sync actually show that it's only creating the statsd-exporter release. My theory is that it's not ordering properly the dependent release, so it can't fill the env vars required for the php-fpm configuration with the statsd related values, leading to the fpm init errors.

@Clement_Goubert - Ah, thanks for the additional details!

Yes, if a missing STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST results in a borked envvars.inc, and that blocks php-fpm startup, then that will do it. Scap does not know or care about non-mediawiki helmfile releases, so unless / until someone brings up the prometheus release out of band (which I think in the case of mw-debug, you did around 13:18), php-fpm startup will fail.

In that case, the question becomes whether it makes sense for scap to support helmfile operations on non-mediawiki releases (and what that might look like, in terms of prescribing sequencing).

Scott_French triaged this task as Medium priority.

Revisiting this, I believe we understand what happened. What remains to be decided is what we plan to do about it, if anything.

Updating (or creating) non-MediaWiki "support" releases like statsd-exporter doesn't feel like it should be scap's responsibility, particularly given their shared nature between MediaWiki releases in the same namespace (i.e., possibly mapping to different deployment stages).

My slight preference would be that we document the bootstrapping problem in the k8s upgrade procedure, with guidance to helmfile apply the support releases first if using scap to bootstrap MediaWiki is the desired approach.

@Clement_Goubert - What are your thoughts? If that sounds reasonable to you, I can do that if can point me to the docs (assigning to myself for the time being).

Revisiting this, I believe we understand what happened. What remains to be decided is what we plan to do about it, if anything.

Updating (or creating) non-MediaWiki "support" releases like statsd-exporter doesn't feel like it should be scap's responsibility, particularly given their shared nature between MediaWiki releases in the same namespace (i.e., possibly mapping to different deployment stages).

Agreed, but maybe scap could check the support releases are available and error out with a helpful message if it's not the case?

My slight preference would be that we document the bootstrapping problem in the k8s upgrade procedure, with guidance to helmfile apply the support releases first if using scap to bootstrap MediaWiki is the desired approach.

@Clement_Goubert - What are your thoughts? If that sounds reasonable to you, I can do that if can point me to the docs (assigning to myself for the time being).

I agree. There's no wikitech doc as far as I can tell, and there should probably be one, but the process is detailed in T405703: Update wikikube eqiad to kubernetes 1.31. We first need to copy this procedure to Wikitech. For the support releases we can either:

  1. Add a loop for helmfile -e $datacenter -l name=$support-releases sync before the scap deployment
  2. Add a facility to charlie to deploy them (trying to cut out the guesswork of what the support releases and mw-on-k8s deployments are)

I think 2 is a better long term investment.

Alright, I've updated the task description on T405703: Update wikikube eqiad to kubernetes 1.31 to reflect two points:

  • During the "Deploy mediawiki" phase, the sequencing constraints we've discussed here, together with example commands for bringing up the support releases.
  • During the "Deploy all the services" phase, charlie in its current form will operate on all mediawiki services as well, which is probably not what we want in practice if we want to do that via scap (or, if we do want to operate on them in this phase, that should be possible if we move the support-release bring-up earlier to ensure it happens first).
  1. Add a facility to charlie to deploy them (trying to cut out the guesswork of what the support releases and mw-on-k8s deployments are)

I think 2 is a better long term investment.

I think doing this in some form makes sense, and indeed is probably a better investment than, e.g., just providing functionality in charlie to optionally exclude mediawiki services (so that they can be sequenced manually / externally).

One possibility is giving charlie the ability to sequence across releases within a namespace, rather than applying all in parallel. This isn't something we would want to use everywhere, but in these cases it would be necessary for reliable bootstrapping, and more generally, anything with a canary release might benefit from sequencing that first.

Thoughts?

Aside: It's unlikely I'm going to have a chance to prioritize working on the above or pulling the updated procedure from T405703 into Wikitech any time soon. However, I'll hold onto this task for now until we converge on a plan-of-record.

Thank you all for untangling and documenting this!
I would like to suggest to uncouple this from the k8s upgrade procedure. It surfaced there, but it is actually a mediawiki bootstraping problem that might bite us in disaster recovery or similar scenarios as well. I'm not totally sure about this but if scap was capable of bootstraping mediawiki in the past, shouldn't it still be able to do so? The comparison is probably bad since we where running support-releases outside of scaps reach in the past (like statsd-exporter for example) but it also feels off to have maintain knowledge about what to do when (like the list of mw namespaces and support releases) in multiple places (scap and wikitech/charlie/...).

If I understand this correctly the main problem is that scap does operate on specific releases which means helmfile dependencies can't be evaluated. Would it be an option to extend scap with a flag that changes it's behavior from deploying particular releases to run helmfile without release selector? The dependencies should really be build into the helmfile releases, so IMHO that would not require additional logic in scap.

Thanks, @JMeybohm!

So, I'd say the main problem is really that we've introduced tight coupling between releases, which makes bootstrapping challenging since it forces sequencing. Investing in loosening that coupling (best possible solution), or ensuring that the appropriate tooling understands those constraints, seems like the right path here.

From a DR perspective, if we're taking the tooling approach, I feel like we should mainly focus on charlie, which is more suited to broadly reconstituting the resources in a cluster from scratch and potentially becoming aware of details like sequencing (if indeed that's needed at all, assuming we can get the resource dependencies right, as you point out!).

One tricky aspect in delegating this to scap is that, at least in our current world, scap must be release-aware: that's how we implement canaries (which exist in the same namespace as the main release, and contribute endpoints to the main service). Which is to say, we can't solve this simply by removing the release selectors, and will need something a bit more involved.

On balance, we could explore adding preflight dependency checks to scap (i.e., "the necessary resources exist for me to deploy mediawiki here") along the lines of what @Clement_Goubert described. That's a much narrower problem to solve than making scap responsible for managing those releases.

Scott_French moved this task from In Progress to Backlog on the ServiceOps new board.

Following up here, it doesn't seem that we've fully landed on a plan of record.

Quick summary:

  • We believe we understand what happened - we introduced inter-release dependencies in a way that forces sequencing during bootstrapping. (T397685#10950885)
  • The ordering issue has been reflected in the description of T405703, which is currently the authoritative description of the bootstrapping procedure. (T397685#11577093)
  • [opinion] We believe that enabling charlie to operate on these support releases (while leaving MediaWiki releases untouched) during bootstrapping is a solid path forward, possibly together with preflight dependency checks in scap (T397685#11592552).

With that, since I'm unlikely to have time to work on this in the foreseeable future, I'm going to unassign (this will remain in backlog). We should consider picking this up for next quarter.