We are investigating how we extend the canary functionalities we already have. Our goal is to catch errors and issues early enough so they don't affect the majority of our users. The problem with our current processes is that we deploy changes first to low traffic wikis which is ok for some errors, but not enough to catch issues that surface at certain amounts of traffic.
What we have already in place is:
- During deployment scap deploys to some servers and monitors error rates on log files, waits for a few seconds, and then deploys to all related servers
- We are able to deploy changes only to mwdebug* servers, and test them by routing specific traffic towards them via our chrome/firefox extension
Those processes can become more efficient by:
- Deploy changes to affect a pre-specified amount of traffic, and increase this amount in stages i.e. start with X% and rump it up all the way to 100%.
- Deploy changes to affect only specific groups of users e.g. beta users, or logged in users
Related: T213156