Canaries canaries canaries
Open, MediumPublic
Actions

Assigned To

None

Authored By

	jijiki
	Nov 22 2018, 1:31 PM

Description

We are investigating how we extend the canary functionalities we already have. Our goal is to catch errors and issues early enough so they don't affect the majority of our users. The problem with our current processes is that we deploy changes first to low traffic wikis which is ok for some errors, but not enough to catch issues that surface at certain amounts of traffic.

What we have already in place is:

During deployment scap deploys to some servers and monitors error rates on log files, waits for a few seconds, and then deploys to all related servers
We are able to deploy changes only to mwdebug* servers, and test them by routing specific traffic towards them via our chrome/firefox extension

Those processes can become more efficient by:

Deploy changes to affect a pre-specified amount of traffic, and increase this amount in stages i.e. start with X% and rump it up all the way to 100%.
Deploy changes to affect only specific groups of users e.g. beta users, or logged in users

Related: T213156

Related Objects
Search...

Status	Assigned	Task
Open	None	T213156 SRE FY2019 Q3:TEC6: First steps towards Canary Deployments
Open	None	T210143 Canaries canaries canaries
Open	None	T209881 Introduce state to Scap
Open	None	T218412 Define a mediawiki "version"
Resolved	hnowlan	T242023 Add alert for app servers in prod serving outdated MediaWiki branches
Resolved	fgiunchedi	T251942 Aggregate check_mw_versions alerts for each individual app server
Open	None	T212147 Allow scap sync to deploy gradually
Resolved	jijiki	T217924 Make canary wait time configurable
Open	None	T218328 Scap2 to use etcd for target servers
Resolved	jijiki	T216518 Improve Scap2 testing
Resolved	Clement_Goubert	T282148 Support Canary releases on Kubernetes