Page MenuHomePhabricator

Make canary wait time configurable
Open, MediumPublic

Description

Introduce the --canary-wait-time flag. This flag will manipulate the 'canary_wait_time' config variable. By setting a a larger canary wait time, when needed, we can avoid deploying changes that cause issues not evident within the current 20s wait.

Additionally, since it is not expected that all deployers will be aware of this flag, we could prompt deployers if they want to increase canary wait time or leave the default

Details

Related Gerrit Patches:
mediawiki/tools/scap : masterAdd --canary-wait-time flag

Event Timeline

jijiki triaged this task as Medium priority.Mar 8 2019, 9:36 PM
jijiki created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2019, 9:36 PM

Change 495398 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[mediawiki/tools/scap@master] Add --canary-wait-time flag

https://gerrit.wikimedia.org/r/495398

Change 495398 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[mediawiki/tools/scap@master] Add --canary-wait-time flag

https://gerrit.wikimedia.org/r/495398

Joe moved this task from Backlog to Externally Blocked on the serviceops board.Jun 21 2019, 9:10 AM
jijiki edited projects, added Operations; removed Patch-For-Review.Jan 15 2020, 6:24 PM
jijiki added a subscriber: mark.

@thcipriani as per our discussion, we can consider merging and testing first for syncing files and then on the train. How does that sound?

jijiki updated the task description. (Show Details)Jan 16 2020, 8:02 AM

@thcipriani ping! :)

pong!

@thcipriani as per our discussion, we can consider merging and testing first for syncing files and then on the train. How does that sound?

The idea to apply this first to scap sync-* and then to scap sync? That sounds fine in terms of rollout. In terms of the code all these mechanisms for syncing follow a similar pattern (from AbstractSync), so it may end up tricky to apply this only to sync-file and not, e.g., sync-wikiversions.

Also, per our discussion, we should be able to use the utils.ask/utils.confirm functions to tell folks:

  1. It's sync'd on canaries
  2. We've run the smoke tests against canaries/checked logstash on the canaries

And then have folks continue with rollout from there.

If it is a lot of work to limit when --canary-wait-time is available, we could do a graceful rollout, by asking deployers, via utils.ask/utils.confirm), to try this flag on, say, scap sync-*. When we are happy with the results, we can move forward, and change the message. It is not ideal, but it is a quick way to push this to prod. How does that sound?

shall we move this forward?

Change 495398 merged by jenkins-bot:
[mediawiki/tools/scap@master] Add --canary-wait-time flag

https://gerrit.wikimedia.org/r/495398

jijiki added a comment.EditedFeb 28 2020, 2:41 PM

Great we merged this patch! Do we have a plan of how we will communicate this to the deployers when we release scap as well as how to test that it is all good in production? Thank you!

I had a discussion with Tyler just now. We plan the following:

  • add to docs/ in the source tree; this ends up in doc.wikimedia.org
  • add to debian/changelog in the source tree
  • update the wikitech Scap page
  • notify ops list
  • notify wikitech-l list
  • have it mentioned in Scrum of Scrums

Further, we plan on me making a new release soon, so we can get this into use sooner rather than later, and to give me more chance to practice the scap release process.

Re testing that all is good in production? Can you expand on what you have in mind?

jijiki added a comment.EditedFeb 28 2020, 4:09 PM

I had a discussion with Tyler just now. We plan the following:

  • add to docs/ in the source tree; this ends up in doc.wikimedia.org
  • add to debian/changelog in the source tree
  • update the wikitech Scap page
  • notify ops list
  • notify wikitech-l list
  • have it mentioned in Scrum of Scrums

Great!

Further, we plan on me making a new release soon, so we can get this into use sooner rather than later, and to give me more chance to practice the scap release process.
Re testing that all is good in production? Can you expand on what you have in mind?

We have tested this in non production environments but what I was going for was before announcing to lists etc, to have a go when deploying minor changes with scap, and then possibly in a group 0 train release. If that is all good, we can go ahead and let deployers know they can start using it

Thank you!

I'm thinking the following now:

  • make a release with --canary-wait-times -- hopefully this week
    • only --canary-wait-times and a build-time acceptance test suite as the only changes
  • SRE builds new package, installs on relevant servers
  • wait for train to have deployed with new version
  • if train has scap related problems, roll back to previous scap release
    • either revert changes and make new release, or SRE downgrades, depending on which is easier for SRE
  • if train went well as far as scap goes, announce --canary-wait-times change to scap users

How does that sound?