Page MenuHomePhabricator

Support Canary releases on Kubernetes
Open, Needs TriagePublic

Description

Canaries

Canarying is the testing in production method where we release a software change to a small percentage of our servers, and monitor for potential issues. It has been our go-to method of testing and rolling out software at WMF.

MediaWiki

When we are rolling out software changes via scap, scap syncs the changes to a small subset of servers (canaries), monitors error rate increases for a specific amount of time, and then syncs to the rest of the cluster. In case of error, scap aborts.

Server software

For updates regarding software surrounding mediawiki, that being PHP, php-fpm, php extensions, underlying libraries, apache, mcrouter etc, we again, first try them out on the canaries, before moving on to the rest of the cluster.

Hotpatching

Developers and SREs, sometimes do try out patches/software experiments on the fly on one or more canary servers, before committing any work on the relevant repositories. This is impossible to support with our current workflow. We need to find alternative paths to address it. Nevertheless, it is beyond the scope of canaries

Additional software for train releases/code updates

We may need an additional deployment tool to emulate scap’s current functionality as far as rolling out to canaries and monitors T276487. Beyond the scope of the current task.

Requirements

  • Route part of production traffic towards specific pods
  • Separate logging for “canary” pods tagged accordingly
  • Separate metrics for “canary” pods tagged accordingly
  • Need canaries for api, appservers, jobrunners, and parsoid
  • Implemented in such a way that people do not step on each other’s toes.
  • Support experiments both for mediawiki itself as well as supporting software (versions, configuration)

Possible implementations

All of the implementations (discussed in T242861) below rely on the usage of different helm releases in the kubernetes realm in order to provide the functionality that satisfies the above requirements. That is by design as helm/helmfile is going to be our deployment tool and we want to build upon them.

Option 1

Use a new service label, set to either the release name or a special addressed_by value, and have the kubernetes Service match either the app or the release, depending on .Values.service.address_other_releases setting. When a release’s pods are addressed by another release, we are not creating a kubernetes Service for it.

Option 2

Adds a new routing_tag label that the kubernetes Service uses to select which pods it should route to. The value of routing_tag is arbitrary and defaults to .Release.Name, which causes the Service to only route to pods in its release. To enable canary releases, we want the main production kubernetes Service to route to the main production release pods as well as the canary release pod. To do this, we set service.routing_tag in values(-main).yaml and values-canary.yaml to a common value shared by both main production and canary releases. Since the main production release kubernetes Service will now route to the canary release pods, the canary release does not need a kubernetes Service resource deployed. This is accomplished by setting service.deployment: none in values-canary.yaml.

Agreed implementation

When a release has the -canary suffix, then kubernetes will not create a service object for this release but only a deployment one. Adding routed_via to the selector of the service object will route traffic to the canary deployment too. The number of replicas in values-canary.yaml dictates, in a way, how much traffic we are pushing through the canaries.

Examples:

services.yaml

{{ if not hasSuffix "canary" .Release.Name }}

{{ include "tls.service" . }}
{{ if not .Values.tls.enabled }}
---
apiVersion: v1
kind: Service
metadata:
  name: {{ template "wmf.releasename" . }}
  labels:
    app: {{ template "wmf.chartname" . }}
    chart: {{ template "wmf.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  type: NodePort
  selector:
    app: {{ template "wmf.chartname" . }}
#### (remove the release:)
 -      release: {{ .Release.Name }}
### (add routed via)
+	routed_via {{ .Release.Name }}
###
  ports:
    - name: {{ .Values.service.port.name }}
      targetPort: {{ .Values.service.port.targetPort }}
      port: {{ .Values.service.port.port }}
      {{- if .Values.service.port.nodePort }}
      nodePort: {{ .Values.service.port.nodePort }}
      {{- end }}
{{- end }}
{{ if .Values.debug.enabled }}
---
apiVersion: v1
kind: Service
metadata:
  name: {{ template "wmf.releasename" . }}-debug
  labels:
    app: {{ template "wmf.chartname" . }}
    chart: {{ template "wmf.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  type: NodePort
  selector:
    app: {{ template "wmf.chartname" . }}
    release: {{ .Release.Name }}
  ports:
    {{- range $port := .Values.debug.ports }}
    - name: {{ template "wmf.releasename" $ }}-debug-{{ $port }}
      targetPort: {{ $port }}
      port: {{ $port }}
    {{- end }}
{{- end }}

{{- end }} {{-/* end hasSuffix */}}

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ template "wmf.releasename" . }}
  labels:
    app: {{ template "wmf.chartname" . }}
    chart: {{ template "wmf.chartid" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
spec:
  selector:
    matchLabels:
      app: {{ template "wmf.chartname" . }}
      release: {{ .Release.Name }}
  replicas: {{ .Values.resources.replicas }}
  template:
    metadata:
      labels:
        app: {{ template "wmf.chartname" . }}
        release: {{ .Release.Name }}
### add a routed_via label, defaulting to the release name)
+	routed_via: {{ .Values.routed_via  | default .Release.Name }}
###
      annotations:
        checksum/secrets: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
        {{ if .Values.monitoring.enabled -}}
        checksum/prometheus-statsd: {{ .Files.Get "config/prometheus-statsd.conf" | sha256sum }}
        {{ end -}}
        prometheus.io/port: "9102"
        prometheus.io/scrape: "true"
        {{- include "tls.annotations" . | indent 8}}
    spec:
      {{- if .Values.affinity }}
{{ toYaml .Values.affinity | indent 6 }}
      {{- end }}
      containers:

values.yaml

resources:
      replicas: 6

values-canary.yaml

routed_via: main
resources:
  replicas: 1

TBD: Monitoring/logging

Many thanks to @JMeybohm @akosiaris @Joe

Event Timeline

Change 685748 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] WIP: Add canary support in scaffolding

https://gerrit.wikimedia.org/r/685748

Change 685748 merged by jenkins-bot:

[operations/deployment-charts@master] Add canary support in scaffolding

https://gerrit.wikimedia.org/r/685748

I have some thoughts about labels and template defines we use. This might not be the right ticket for these questions, but since they do relate to how canary releases work, I'll ask here anyway.

Edit: moved to a dedicated ticket: T291848: Clarify common k8s label and service conventions in our helm charts

@jijiki I think this task can be closed?

I think what we are missing here is how to get prometheus metrics strictly for the canary deployment. I confess I have not dug deeper into this.