Support Canary releases on Kubernetes
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	jijiki
	May 6 2021, 2:46 PM

Description

Canaries

Canarying is the testing in production method where we release a software change to a small percentage of our servers, and monitor for potential issues. It has been our go-to method of testing and rolling out software at WMF.

MediaWiki

When we are rolling out software changes via scap, scap syncs the changes to a small subset of servers (canaries), monitors error rate increases for a specific amount of time, and then syncs to the rest of the cluster. In case of error, scap aborts.

Server software

For updates regarding software surrounding mediawiki, that being PHP, php-fpm, php extensions, underlying libraries, apache, mcrouter etc, we again, first try them out on the canaries, before moving on to the rest of the cluster.

Hotpatching

~~Developers and SREs, sometimes do try out patches/software experiments on the fly on one or more canary servers, before committing any work on the relevant repositories.~~ This is impossible to support with our current workflow. We need to find alternative paths to address it. Nevertheless, it is beyond the scope of canaries

Additional software for train releases/code updates

We may need an additional deployment tool to emulate scap’s current functionality as far as rolling out to canaries and monitors T276487. Beyond the scope of the current task.

Requirements

Route part of production traffic towards specific pods
Separate logging for “canary” pods tagged accordingly
Separate metrics for “canary” pods tagged accordingly
Need canaries for api, appservers, jobrunners, and parsoid
Implemented in such a way that people do not step on each other’s toes.
Support experiments both for mediawiki itself as well as supporting software (versions, configuration)

Possible implementations

All of the implementations (discussed in T242861) below rely on the usage of different helm releases in the kubernetes realm in order to provide the functionality that satisfies the above requirements. That is by design as helm/helmfile is going to be our deployment tool and we want to build upon them.

Option 1

Use a new service label, set to either the release name or a special addressed_by value, and have the kubernetes Service match either the app or the release, depending on .Values.service.address_other_releases setting. When a release’s pods are addressed by another release, we are not creating a kubernetes Service for it.

Option 2

Adds a new routing_tag label that the kubernetes Service uses to select which pods it should route to. The value of routing_tag is arbitrary and defaults to .Release.Name, which causes the Service to only route to pods in its release. To enable canary releases, we want the main production kubernetes Service to route to the main production release pods as well as the canary release pod. To do this, we set service.routing_tag in values(-main).yaml and values-canary.yaml to a common value shared by both main production and canary releases. Since the main production release kubernetes Service will now route to the canary release pods, the canary release does not need a kubernetes Service resource deployed. This is accomplished by setting service.deployment: none in values-canary.yaml.

Agreed implementation

When a release has the -canary suffix, then kubernetes will not create a service object for this release but only a deployment one. Adding routed_via to the selector of the service object will route traffic to the canary deployment too. The number of replicas in values-canary.yaml dictates, in a way, how much traffic we are pushing through the canaries.

Examples:

services.yaml

{{ if not hasSuffix "canary" .Release.Name }}

{{ include "tls.service" . }}
{{ if not .Values.tls.enabled }}
---
apiVersion: v1
kind: Service
metadata:
  name: {{ template "wmf.releasename" . }}
  labels:
    app: {{ template "wmf.chartname" . }}
    chart: {{ template "wmf.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  type: NodePort
  selector:
    app: {{ template "wmf.chartname" . }}
#### (remove the release:)
 -      release: {{ .Release.Name }}
### (add routed via)
+	routed_via {{ .Release.Name }}
###
  ports:
    - name: {{ .Values.service.port.name }}
      targetPort: {{ .Values.service.port.targetPort }}
      port: {{ .Values.service.port.port }}
      {{- if .Values.service.port.nodePort }}
      nodePort: {{ .Values.service.port.nodePort }}
      {{- end }}
{{- end }}
{{ if .Values.debug.enabled }}
---
apiVersion: v1
kind: Service
metadata:
  name: {{ template "wmf.releasename" . }}-debug
  labels:
    app: {{ template "wmf.chartname" . }}
    chart: {{ template "wmf.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  type: NodePort
  selector:
    app: {{ template "wmf.chartname" . }}
    release: {{ .Release.Name }}
  ports:
    {{- range $port := .Values.debug.ports }}
    - name: {{ template "wmf.releasename" $ }}-debug-{{ $port }}
      targetPort: {{ $port }}
      port: {{ $port }}
    {{- end }}
{{- end }}

{{- end }} {{-/* end hasSuffix */}}

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ template "wmf.releasename" . }}
  labels:
    app: {{ template "wmf.chartname" . }}
    chart: {{ template "wmf.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  selector:
    matchLabels:
      app: {{ template "wmf.chartname" . }}
      release: {{ .Release.Name }}
  replicas: {{ .Values.resources.replicas }}
  template:
    metadata:
      labels:
        app: {{ template "wmf.chartname" . }}
        release: {{ .Release.Name }}
### add a routed_via label, defaulting to the release name)
+	routed_via: {{ .Values.routed_via  | default .Release.Name }}
###
      annotations:
        checksum/secrets: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
        {{ if .Values.monitoring.enabled -}}
        checksum/prometheus-statsd: {{ .Files.Get "config/prometheus-statsd.conf" | sha256sum }}
        {{ end -}}
        prometheus.io/port: "9102"
        prometheus.io/scrape: "true"
        {{- include "tls.annotations" . | indent 8}}
    spec:
      {{- if .Values.affinity }}
{{ toYaml .Values.affinity | indent 6 }}
      {{- end }}
      containers:

values.yaml

resources:
      replicas: 6

values-canary.yaml

routed_via: main
resources:
  replicas: 1

TBD: Monitoring/logging

Many thanks to @JMeybohm @akosiaris @Joe

Details

	Subject	Repo	Branch	Lines +/-
	Add canary support in scaffolding	operations/deployment-charts	master	+732 -6

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T213156 SRE FY2019 Q3:TEC6: First steps towards Canary Deployments
Open	None	T210143 Canaries canaries canaries
Open	None	T282148 Support Canary releases on Kubernetes

Event Timeline

jijiki created this task.May 6 2021, 2:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2021, 2:46 PM

jijiki updated the task description. (Show Details)May 7 2021, 3:07 PM

jijiki added projects: serviceops, SRE.May 7 2021, 3:09 PM

jijiki updated the task description. (Show Details)

jijiki added subscribers: JMeybohm, akosiaris, Joe, Ottomata.

MoritzMuehlenhoff subscribed.May 7 2021, 3:11 PM

jijiki added a parent task: T210143: Canaries canaries canaries.May 7 2021, 3:12 PM

Change 685748 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] WIP: Add canary support in scaffolding

https://gerrit.wikimedia.org/r/685748

gerritbot added a project: Patch-For-Review.May 7 2021, 3:17 PM

jijiki updated the task description. (Show Details)May 7 2021, 3:30 PM

Ottomata awarded a token.May 7 2021, 4:51 PM

Change 685748 merged by jenkins-bot:

[operations/deployment-charts@master] Add canary support in scaffolding

https://gerrit.wikimedia.org/r/685748

Maintenance_bot removed a project: Patch-For-Review.May 18 2021, 7:10 AM

jijiki mentioned this in T242861: Clarify multi-service instance concepts in helm charts and enable canary releases.May 19 2021, 10:57 AM

• Marostegui removed a project: SRE.May 24 2021, 11:40 AM

I have some thoughts about labels and template defines we use. This might not be the right ticket for these questions, but since they do relate to how canary releases work, I'll ask here anyway.

Edit: moved to a dedicated ticket: T291848: Clarify common k8s label and service conventions in our helm charts

Ottomata mentioned this in T291504: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy.Sep 23 2021, 6:52 PM

@jijiki I think this task can be closed?

I think what we are missing here is how to get prometheus metrics strictly for the canary deployment. I confess I have not dug deeper into this.

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:23 PM

jijiki moved this task from 🙈🙉🙊Backlog to 🛎 Services & Oids on the serviceops board.Nov 11 2022, 2:28 PM

jijiki moved this task from 🛎 Services & Oids to ⎈Kubernetes on the serviceops board.Nov 15 2022, 9:04 AM