During the WE5-WE6 offsite, we (mysql, @dancy, @brennen, and @jeena) discussed the idea of extending scap's functionality to support deploying Kubernetes services.
What?
We explored how scap could be extended to enable the deployment of Kubernetes services, allowing for a unified deployment experience across our infrastructure.
Why?
This work would improve developer confidence in deployments by providing a consistent, reliable process. More importantly, it would allow every deployer and SRE to deploy any service without needing to know all the service specifics beforehand, reducing friction and enabling faster, more confident deployments across the org.
Current Status
We only deploy MediaWiki via scap/spiderpig. One of scap’s key features is error-rate monitoring: metrics are checked during deployment to ensure error rates stay within safe thresholds, and the deploy can be automatically paused or rolled back if errors spike. This significantly increases deployer confidence, since the vast majority of production errors are caught before the change is rolled out everywhere.
When deploying other services, deployers have to run helmfile manually. Unless pods fail to start, sanity checks are performed by deployers manually either through smoke tests and/or checking bespoke dashboards.It gets even trickier when SREs need to redeploy all services in a Kubernetes cluster due to, for example, image upgrades such as Envoy.
How?
The integration of scap with MW-on-K8s has established many of the foundational components required for this initiative. Existing capabilities such as running helmfile, updating image versions, rolling out to canaries, and monitoring error rates are already in place.
Scap Specific work
TBA: more detailed input from Release-Engineering-Team
What would developer teams need to provide?
To onboard a service to scap deploy-service, development teams must provide the following:
- Logstash expressions that clearly exhibit error rates
- Smoke tests
- Grafana dashboards that clearly demonstrate service health
- Links to related dashboards
What will the deployer's UX be?
We are aiming for a streamlined deployment process (high level):
- Run scap service-deploy <service-name> (from any directory)
- Automatically roll out to canaries, if available, otherwise to a single datacenter
- Automatically monitor and validate error rates
- Execute helmfile post-upgrade hooks (various options here, TBD)
- Complete the rollout upon successful validation
- Print additional information, eg a grafana dashboard
Note: Given that MediaWiki interacts with most WikiKube services, we could consider adding an optional/mandatory check of mediawiki's error rate too.
🕷️ Weaving the Pig 🐷
The longer-term goal would be to enable one-push deployments via Spiderpig, allowing teams to trigger deployments with minimal manual intervention. This would substantially lower the bar to deployment, enabling even less experienced team members to confidently ship services to production.