Page MenuHomePhabricator

Introduce state to Scap
Open, MediumPublic

Description

This is a suggestion on how we could start implementing canaries in a more quantitative manner.

Goals

  • Route any amount of traffic through specific app/api servers running release X.
  • Ability to increase this amount (gradually or not)
  • Have a view of how everything is going while we are at it

---> This way we could catch issues not visible when deploying to Group0 and Group1 due to low traffic, without affecting 100% of our users

Problem
Scap blindly rsyncs to app/api servers and is not aware of what. Currently we can't tell scap to deploy to a random number of servers an arbitrary version/revision of MediaWiki.

Proposed solution
We could have scap keep a state file on each server, which may include:

  • the version of the last "stable" deployment (i.e. last week's train)
  • the version currently deployed on the server

Example workflow:

  1. $ scap sync --canary=10% --version=1.33.0-wmf.6
  2. Check dashboards, error rates on logs
  3. $ scap sync --canary=20% --version=1.33.0-wmf.6

On step 3, scap will poll all servers to check which ones are already running version 1.33.0-wmf.6, calculates to how many more servers it needs to add up to 20% of traffic, and move forward with deploying to the additional servers.

We already have those functionalities partially implemented on scap, which is great :)

Event Timeline

jijiki triaged this task as Medium priority.Nov 19 2018, 8:57 PM
jijiki created this task.

This sounds like a workable plan. I anticipate that this will be a medium-sized project given the current state of scap. That is, I think this can be accomplished in a quarter or so given appropriate resources.

Random thoughts

tl;dr

I'm excited to see this happen. I think it's possible to do. I would like it to support more than just the train. There are lots of things to decide, in fact, I'll ramble at length about a handful :)

Beyond initial train

The proposal seems to mostly discuss initial train deployment (i.e., the example mentions a --version=1.33.0-wmf.6 flag); however, configuration deployment is responsible for almost as many incidents as the MediaWiki core repository (and outpaces all extensions) (source: https://phabricator.wikimedia.org/phame/post/view/128/incident_documentation_an_unexpected_journey/). I think it would be good to try to make this useful for all aspects of deploying MediaWiki and extensions.

There are a few aspects of deploying MediaWiki that scap touches:

  1. mediawiki-config deployment
  2. portals deployment
  3. initial train deployment
  4. train backports

Version number

To support these use cases we'll need a version number for what's deployed in production (which we currently don't have). It is probably as easy as taking the md5 of a directory tarball, but there may be confounding factors here. I implemented this by flattening /srv/mediawiki into a single git repo. Git filled the disk after a period of time on the order of magnitude of months. Currently this feature is feature-flagged behind the scap3_mediawiki configuration flag that is currently False everywhere.

Percentages as fixed servers

Scap currently uses dsh groups on disk on deploy1001, e.g., mediawiki-api-canaries and mediawiki-app-canaries. These files are generated by puppet using pybal (is my understanding, @Joe implemented this IIRC and could maybe give more opinions). Perhaps this functionality could be useful for building groups based on percentage? This is mostly thinking out-loud and not a fully formed thought.

Example UI

Scap has the --force flag to skip canary checks for MediaWiki. Perhaps the UI could be more interactive unless the --force flag is used (or we could implement a --non-interactive or --batch-mode flag). That is, a session could look like:

scap sync-file wmf-config/InitialiseSettings.php 'SWAT: [[gerrit:12345|Not a real patch]]'
...
00:00:00 == Syncing Canaries by Percentage ==
00:00:01 Started Sync 10%
00:00:02 sync-10%: 100% (ok: 10; fail: 0; left: 0)
00:00:03 10% of Production Traffic Synced: Continue [y/N]? Y
00:00:04 sync-20%: 100% (ok: 10; fail: 0; left: 0)
00:00:05 20% of Production Traffic Synced: Continue [y/N]? Y
...
etc
...

Although supporting longer time-periods with the --percentage flag seems like a good thing to support as well

brennen moved this task from Backlog to Radar on the User-brennen board.
brennen added a subscriber: brennen.