Page MenuHomePhabricator

Proposal: scap deploy-service
Open, MediumPublicFeature

Description

During the WE5-WE6 offsite, we (mysql, @dancy, @brennen, and @jeena) discussed the idea of extending scap's functionality to support deploying Kubernetes services.

What?

We explored how scap could be extended to enable the deployment of Kubernetes services, allowing for a unified deployment experience across our infrastructure.

Why?

This work would improve developer confidence in deployments by providing a consistent, reliable process. More importantly, it would allow every deployer and SRE to deploy any service without needing to know all the service specifics beforehand, reducing friction and enabling faster, more confident deployments across the org.

Current Status

We only deploy MediaWiki via scap/spiderpig. One of scap’s key features is error-rate monitoring: metrics are checked during deployment to ensure error rates stay within safe thresholds, and the deploy can be automatically paused or rolled back if errors spike. This significantly increases deployer confidence, since the vast majority of production errors are caught before the change is rolled out everywhere.

When deploying other services, deployers have to run helmfile manually. Unless pods fail to start, sanity checks are performed by deployers manually either through smoke tests and/or checking bespoke dashboards.It gets even trickier when SREs need to redeploy all services in a Kubernetes cluster due to, for example, image upgrades such as Envoy.

How?

The integration of scap with MW-on-K8s has established many of the foundational components required for this initiative. Existing capabilities such as running helmfile, updating image versions, rolling out to canaries, and monitoring error rates are already in place.

Scap Specific work

TBA: more detailed input from Release-Engineering-Team

What would developer teams need to provide?

To onboard a service to scap deploy-service, development teams must provide the following:

  • Logstash expressions that clearly exhibit error rates
  • Smoke tests
  • Grafana dashboards that clearly demonstrate service health
  • Links to related dashboards

What will the deployer's UX be?

We are aiming for a streamlined deployment process (high level):

  • Run scap service-deploy <service-name> (from any directory)
  • Automatically roll out to canaries, if available, otherwise to a single datacenter
  • Automatically monitor and validate error rates
  • Execute helmfile post-upgrade hooks (various options here, TBD)
  • Complete the rollout upon successful validation
  • Print additional information, eg a grafana dashboard

Note: Given that MediaWiki interacts with most WikiKube services, we could consider adding an optional/mandatory check of mediawiki's error rate too.

🕷️ Weaving the Pig 🐷

The longer-term goal would be to enable one-push deployments via Spiderpig, allowing teams to trigger deployments with minimal manual intervention. This would substantially lower the bar to deployment, enabling even less experienced team members to confidently ship services to production.

Event Timeline

jijiki renamed this task from Proposal: `scap` deploy-service to Proposal: scap deploy-service.Dec 17 2025, 1:08 PM
JMeybohm edited projects, added ServiceOps new, Epic; removed serviceops.
JMeybohm moved this task from Inbox to Needs Info / Blocked on the ServiceOps new board.

Release-Engineering-Team could you please provide inputs on the Scap specific works in the description?

Release-Engineering-Team could you please provide inputs on the Scap specific works in the description?

I gave the description a fresh read today. Everything looks good.

It would be helpful if someone selected a representative service to use while developing scap deploy-service. Ideally something I can get running in train-dev.

Release-Engineering-Team could you please provide inputs on the Scap specific works in the description?

I gave the description a fresh read today. Everything looks good.

It would be helpful if someone selected a representative service to use while developing scap deploy-service. Ideally something I can get running in train-dev.

Any of the Shellboxen will probably do, or citoid, no strong opinions here

Data-Platform-SRE would be very interested to help out as well! We have 11 (soon 12) airflow instances we have to manually deploy one by one everytime we make a change to charts/airflow, which is becoming old.

Balthazar already mentioned our (as in Data-Platform-SRE ) interest, but it would also be just as handy for OpenSearch on K8s as it would be for Airflow, for exactly the same reasons. We are happy to help experiment when the time is right.

MLechvien-WMF raised the priority of this task from Low to Medium.Tue, Feb 10, 8:55 AM

Thanks @brouberol and @bking .

To help make the case to prioritize this work, would you be able to estimate the toil of your current release process: total SRE time and savings we could hope to achieve with this automation?

I can think of a couple more services that might be good candidates for this:

  • eventgate currently has 4 services that are deployed from one chart.
  • Quite a few AQS services use the aqs-http-gateway
btullis@barracuda:~/wmf/deployment-charts/helmfile.d/services$ grep -R wmf-stable/aqs-http-gateway -l
media-analytics/helmfile.yaml
commons-impact-analytics/helmfile.yaml
geo-analytics/helmfile.yaml
page-analytics/helmfile.yaml
edit-analytics/helmfile.yaml
device-analytics/helmfile.yaml
editor-analytics/helmfile.yaml
image-suggestion/helmfile.yaml
data-gateway/helmfile.yaml

toil of your current release process: total SRE time and savings we could hope to achieve with this automation?

We probably spend about 20-30 minutes of manual deploy every time we release a new airflow chart version (of which there has been 163 to this day, but not some of these versions were released at the same time). It's annoying, and mental state to maintain, but not life-threatening.

MLechvien-WMF changed the subtype of this task from "Task" to "Feature Request".Fri, Feb 13, 9:10 AM

sanity checks are performed by deployers manually either through smoke tests and/or checking bespoke dashboards.

As someone who as onboarded to service deployment recently in the context of T399291: Epic: API Rate Limiting Architecture I want to highlight the importance of this aspect. Manually testing a service on the staging cluster, or after full deployment on the live cluster, is time consuming and error prone. It would be extremely helpful to have a framework in place for automating this.

For my own purposes, I ended up writing a small framework for testing the gateway on the staging cluster during deployment. That reduces the time needed to verify that the deployment works as expected from at least 30 minutes of error prone manual testing to about 2 minutes of running make check. I particularly like that the automated tests protect against regressions reliably.

I opted for make and python for this purpose, since they are available on the deployment hosts. Containerizing this and invoking it through helm test would probably be a good idea. Some charts are already using helm test to run service-checker. This seems like a nice way to do smoke testing based on OpenAPI specs. It 's not suitable for testing rate limits on a gateway.

There are also the rake based tests used for service charts in CI, but afaik they are used primarily to generate before-and-after diffs to see the impact a template change will have on the generated k8s manifests. I don't know how easy it would be to extend them to allow for more sophisticated pre-deployment testing in CI.

In any case, I think that a framework for post-deployment tests will be key to making scap-service safe and useful. I hope that the work I did on this in the context of API rate limiting can serve as a good example.