Canaries
Canarying is the testing in production method where we release a software change to a small percentage of our servers, and monitor for potential issues. It has been our go-to method of testing and rolling out software at WMF.
MediaWiki
When we are rolling out software changes via scap, scap syncs the changes to a small subset of servers (canaries), monitors error rate increases for a specific amount of time, and then syncs to the rest of the cluster. In case of error, scap aborts.
Server software
For updates regarding software surrounding mediawiki, that being PHP, php-fpm, php extensions, underlying libraries, apache, mcrouter etc, we again, first try them out on the canaries, before moving on to the rest of the cluster.
Hotpatching
Developers and SREs, sometimes do try out patches/software experiments on the fly on one or more canary servers, before committing any work on the relevant repositories. This is impossible to support with our current workflow. We need to find alternative paths to address it. Nevertheless, it is beyond the scope of canaries
Additional software for train releases/code updates
We may need an additional deployment tool to emulate scap’s current functionality as far as rolling out to canaries and monitors T276487. Beyond the scope of the current task.
Requirements
- Route part of production traffic towards specific pods
- Separate logging for “canary” pods tagged accordingly
- Separate metrics for “canary” pods tagged accordingly
- Need canaries for api, appservers, jobrunners, and parsoid
- Implemented in such a way that people do not step on each other’s toes.
- Support experiments both for mediawiki itself as well as supporting software (versions, configuration)
Possible implementations
All of the implementations (discussed in T242861) below rely on the usage of different helm releases in the kubernetes realm in order to provide the functionality that satisfies the above requirements. That is by design as helm/helmfile is going to be our deployment tool and we want to build upon them.
Option 1
Use a new service label, set to either the release name or a special addressed_by value, and have the kubernetes Service match either the app or the release, depending on .Values.service.address_other_releases setting. When a release’s pods are addressed by another release, we are not creating a kubernetes Service for it.
Option 2
Adds a new routing_tag label that the kubernetes Service uses to select which pods it should route to. The value of routing_tag is arbitrary and defaults to .Release.Name, which causes the Service to only route to pods in its release. To enable canary releases, we want the main production kubernetes Service to route to the main production release pods as well as the canary release pod. To do this, we set service.routing_tag in values(-main).yaml and values-canary.yaml to a common value shared by both main production and canary releases. Since the main production release kubernetes Service will now route to the canary release pods, the canary release does not need a kubernetes Service resource deployed. This is accomplished by setting service.deployment: none in values-canary.yaml.
Agreed implementation
When a release has the -canary suffix, then kubernetes will not create a service object for this release but only a deployment one. Adding routed_via to the selector of the service object will route traffic to the canary deployment too. The number of replicas in values-canary.yaml dictates, in a way, how much traffic we are pushing through the canaries.
Examples:
services.yaml
{{ if not hasSuffix "canary" .Release.Name }} {{ include "tls.service" . }} {{ if not .Values.tls.enabled }} --- apiVersion: v1 kind: Service metadata: name: {{ template "wmf.releasename" . }} labels: app: {{ template "wmf.chartname" . }} chart: {{ template "wmf.chartid" . }} release: {{ .Release.Name }} heritage: {{ .Release.Service }} spec: type: NodePort selector: app: {{ template "wmf.chartname" . }} #### (remove the release:) - release: {{ .Release.Name }} ### (add routed via) + routed_via {{ .Release.Name }} ### ports: - name: {{ .Values.service.port.name }} targetPort: {{ .Values.service.port.targetPort }} port: {{ .Values.service.port.port }} {{- if .Values.service.port.nodePort }} nodePort: {{ .Values.service.port.nodePort }} {{- end }} {{- end }} {{ if .Values.debug.enabled }} --- apiVersion: v1 kind: Service metadata: name: {{ template "wmf.releasename" . }}-debug labels: app: {{ template "wmf.chartname" . }} chart: {{ template "wmf.chartid" . }} release: {{ .Release.Name }} heritage: {{ .Release.Service }} spec: type: NodePort selector: app: {{ template "wmf.chartname" . }} release: {{ .Release.Name }} ports: {{- range $port := .Values.debug.ports }} - name: {{ template "wmf.releasename" $ }}-debug-{{ $port }} targetPort: {{ $port }} port: {{ $port }} {{- end }} {{- end }} {{- end }} {{-/* end hasSuffix */}}
deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: {{ template "wmf.releasename" . }} labels: app: {{ template "wmf.chartname" . }} chart: {{ template "wmf.chartid" . }} release: {{ .Release.Name }} heritage: {{ .Release.Service }} spec: selector: matchLabels: app: {{ template "wmf.chartname" . }} release: {{ .Release.Name }} replicas: {{ .Values.resources.replicas }} template: metadata: labels: app: {{ template "wmf.chartname" . }} release: {{ .Release.Name }} ### add a routed_via label, defaulting to the release name) + routed_via: {{ .Values.routed_via | default .Release.Name }} ### annotations: checksum/secrets: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }} {{ if .Values.monitoring.enabled -}} checksum/prometheus-statsd: {{ .Files.Get "config/prometheus-statsd.conf" | sha256sum }} {{ end -}} prometheus.io/port: "9102" prometheus.io/scrape: "true" {{- include "tls.annotations" . | indent 8}} spec: {{- if .Values.affinity }} {{ toYaml .Values.affinity | indent 6 }} {{- end }} containers:
values.yaml
resources: replicas: 6
values-canary.yaml
routed_via: main resources: replicas: 1
TBD: Monitoring/logging
Many thanks to @JMeybohm @akosiaris @Joe