Improve failure handling and rollback behavior
Refactored ssh.Job run methods, removing unused behavior of run with
no reporter, and adding a run_with_status method that yields each host
and exit status as remote execution completes.
Refactored targets.TargetList to return each deploy group as a new
first class targets.DeployGroup object which is no longer split up
based on the group's group_size configuration. This provides a better
interface for both getting information about a group in its entirety
while still allowing easy iteration over the serially deployed
"subgroups" that are based on the its size.
Added failure_limit config variable to allow for greater control over
what rate of failure is acceptable. This rate is configurable globally
or per group and can either be a integer number of hosts or a string
percentage of hosts (e.g. 10 or '10%').
Refactored deploy module's group/stage execution methods to:
- Skip rollback on targets for which an SSH connection fails.
- Only trigger rollback when the failure_limit is exceeded.
- Rollback all deployed groups in reverse order.
Some related behavior has also changed as a result of the refactoring:
- The user is no longer prompted after each "subgroup" deployment, only after each originally defined deploy group.
- The deploy group name passed to the remote deploy-local -g process is now (correctly) the original deploy group name, not the subgroup label (e.g. 'canary' not 'canary1').
- The finalize stage now executes after all groups have completed their primary stages
Simulate failure on a multi-group deployment and ensure items 1-3 above behave
Reviewers: demon, mobrovac, mmodell, Release-Engineering-Team, thcipriani
Reviewed By: mmodell, Release-Engineering-Team, thcipriani
Differential Revision: https://phabricator.wikimedia.org/D490