HomePhabricator
Diffusion Scap 98247477db5d

Improve failure handling and rollback behavior

Authored by dduvall on Dec 20 2016, 12:10 AM.

Description

Improve failure handling and rollback behavior

Tags: Release-Engineering-Team

Maniphest Tasks: T145460, T145512, T149008

Summary:
Refactored ssh.Job run methods, removing unused behavior of run with
no reporter, and adding a run_with_status method that yields each host
and exit status as remote execution completes.

Refactored targets.TargetList to return each deploy group as a new
first class targets.DeployGroup object which is no longer split up
based on the group's group_size configuration. This provides a better
interface for both getting information about a group in its entirety
while still allowing easy iteration over the serially deployed
"subgroups" that are based on the its size.

Added failure_limit config variable to allow for greater control over
what rate of failure is acceptable. This rate is configurable globally
or per group and can either be a integer number of hosts or a string
percentage of hosts (e.g. 10 or '10%').

Refactored deploy module's group/stage execution methods to:

  1. Skip rollback on targets for which an SSH connection fails.
  2. Only trigger rollback when the failure_limit is exceeded.
  3. Rollback all deployed groups in reverse order.

Some related behavior has also changed as a result of the refactoring:

  1. The user is no longer prompted after each "subgroup" deployment, only after each originally defined deploy group.
  2. The deploy group name passed to the remote deploy-local -g process is now (correctly) the original deploy group name, not the subgroup label (e.g. 'canary' not 'canary1').
  3. The finalize stage now executes after all groups have completed their primary stages

Fixes T149008 T145512 T145460

Test Plan:
Simulate failure on a multi-group deployment and ensure items 1-3 above behave
as described.

Reviewers: demon, mobrovac, mmodell, Release-Engineering-Team, thcipriani

Reviewed By: mmodell, Release-Engineering-Team, thcipriani

Subscribers: jenkins

Differential Revision: https://phabricator.wikimedia.org/D490