Page MenuHomePhabricator

Improve failure handling and rollback behavior

Authored by dduvall on Dec 2 2016, 4:25 AM.



Refactored ssh.Job run methods, removing unused behavior of run with
no reporter, and adding a run_with_status method that yields each host
and exit status as remote execution completes.

Refactored targets.TargetList to return each deploy group as a new
first class targets.DeployGroup object which is no longer split up
based on the group's group_size configuration. This provides a better
interface for both getting information about a group in its entirety
while still allowing easy iteration over the serially deployed
"subgroups" that are based on the its size.

Added failure_limit config variable to allow for greater control over
what rate of failure is acceptable. This rate is configurable globally
or per group and can either be a integer number of hosts or a string
percentage of hosts (e.g. 10 or '10%').

Refactored deploy module's group/stage execution methods to:

  1. Skip rollback on targets for which an SSH connection fails.
  2. Only trigger rollback when the failure_limit is exceeded.
  3. Rollback all deployed groups in reverse order.

Some related behavior has also changed as a result of the refactoring:

  1. The user is no longer prompted after each "subgroup" deployment, only after each originally defined deploy group.
  2. The deploy group name passed to the remote deploy-local -g process is now (correctly) the original deploy group name, not the subgroup label (e.g. 'canary' not 'canary1').
  3. The finalize stage now executes after all groups have completed their primary stages

Fixes T149008 T145512 T145460

Test Plan

Simulate failure on a multi-group deployment and ensure items 1-3 above behave
as described.

Diff Detail

rMSCA Scap
Lint Not Applicable
Tests Not Applicable

Event Timeline

dduvall retitled this revision from to Improve failure handling and rollback behavior.
dduvall updated this object.
dduvall edited the test plan for this revision. (Show Details)

This might seem a bit goofy but since execute_stage_on_group is now a generator function, you need to iterate over its result for it to do anything.

In testing this locally, I noticed a new issue with the failure_limit functionality. If a target fails during one stage and the threshold for failure isn't reached, the next stage will execute on said target. This is wrong obviously, but how to proceed? There are two options I can think of.

  1. Execute immediate rollback for failed targets as the final final stage of each group. If the failure is an SSH connection failure, go with the next option.
  2. Simply remove the failed host from the list of targets and roll on.
dduvall edited edge metadata.

Stage execution now correctly skips failed targets and rollback executes over all reachable target hosts

Idea: would it make sense to retry the failed nodes at the end of the group/deployment if it's not an SSH connection issue? A failure can arise during: (a) the git check out; or (b) during the service restart (and checks execution). In the former case, a forced git check out might be the cure, while the latter represents a rather bizarre situation (supposing that the other nodes go through this stage successfully).


Feature request: could this be made so that if failure_limit < 1 it is considered a percentage, but if failure_limit >= 1 it is interpreted as the number of hosts?


Sure thing. If we're going to allow both an integer and a percentage, perhaps it would be better for the two formats to be n and n%?

thcipriani edited edge metadata.

This works so well! Very nice.

Added a few nits inline.

A few overall things:

  1. Even though the docs said, "Percentage, as a float between 0 and 1" I still used 10.0 meaning 10% in testing – this point is probably moot considering the discussion about using n vs n% plus it's only 1 data point.
  2. failure_limit vs canary_failure_limit – it's a little confusing to me that failure_limit isn't a percentage of all groups and [group]_failure_limit isn't a percentage of that group. That is, it was surprising that failure_limit: 0.20 was triggered when 1/10 total hosts failed (because I have my group_size set to 3).
  3. It is difficult to set [group]_failure_limit if you have group_size set. If you set group_size you get dynamic group names that look like default1 -- which you can target with default1_failure_limit, but there's no way to say: I want the default group to have a failure limit of 0.20. In practice, I suppose, this can be worked by saying failure_limit: 0.20 canary_failure_limit: 0, but it gets a little weird.

would be better for the two formats to be n and n%?

FWIW, I think that's neat :)


This is a good call.


hrm. I worry about other things that may return 255. This may not be a very real concern...


Does something like self.execute_stage_on_group(ROLLBACK, grp, targets).next() work?

Could probably use a comment regardless.



This revision is now accepted and ready to land.Dec 6 2016, 12:36 AM

This should only be an issue if deploy-local for some reason had an exit status of 255 which I don't think is a possibility.


You'd have to call next() until a StopIteration exception is thrown, which would be more verbose and probably just as confusing. But yeah, a comment is much needed here. I'll add that.

The absolute shortest syntax would be to pass the resulting generator object to list() but that's lame, too, I think. It's just going to have to remain a little weird. :)

dduvall edited edge metadata.
  • Refactored targets.TargetsList to no longer split groups but return first-class instances of a new targets.DeployGroup class
  • Stage execution methods now considers the failure_limit for the entire original group, not individual subgroups
  • Some additional behavior changed slightly as a result of this refactoring (see the commit message)
This revision is now accepted and ready to land.Dec 14 2016, 11:11 PM
mmodell edited edge metadata.

nice work!

dduvall edited edge metadata.

Fixed a small bug in the rollback message for when failure_limit is exceeded

dduvall requested a review of this revision.Dec 15 2016, 7:11 PM
dduvall edited edge metadata.

Requesting re-review after some minor changes.

thcipriani edited edge metadata.

This change isn't actually rolling back all groups, but is executing rollbacks on all groups.

The problem is that the finalize stage before a rollback, which leave no .in-progress file on the target and nothing to rollback to.

I tested this by dividing my deploy into 2 groups:

server_groups: canary,default
canary_dsh_targets: mockbase_canary

Then I stopped ssh on a random target that is not in the canary group (so it fails). I then created a new commit and deployed. What happens is all stages run on the canary group (including finalize), then deploy happens on the default group — it hits the broken ssh server and prompts for rollback at the end of the default group, but the canary servers still point deploy-cache/current at the newest commit because .in-progress has already been removed from those machines.

I could have sworn this worked the first time I tested this, but maybe not :)

All the looks good — I like the inclusion of the DeployGroup object, probably needed at this point.

This revision now requires changes to proceed.Dec 17 2016, 6:52 PM
dduvall edited edge metadata.

Yet another refactoring, simplification of rollback invocation using existing _execute_for_group method, and moved finalize stage to execute following primary stages for all groups which ensures rollback can actually function as expected.

thcipriani edited edge metadata.

Nicely done!

I left a few thinking-out-loud-type comments inline — none are critical.

Phew! Rollback is hard to think about :)


Might checkout utils.confirm here. @mmodell added it -- a little fancier.


hrm. I don't know if it matters, but there may be hosts in the group that haven't actually been deployed to yet (because of group_size) on which a rollback will be attempted.

I don't think this matters. It attempts to move a symlink to a symlink that it already points to and then runs promote which is idempotent and exits early:

21:58:30 [scap-target-10] Rolling back from revision None to e1ec15e5df8b2f076004a20321a5f495fe55436d
21:58:30 [scap-target-10] /srv/deployment/mockbase/deploy-cache/revs/e1ec15e5df8b2f076004a20321a5f495fe55436d is already live (use --force to override)

good call. Didn't realize this was a magic number down in the function call until this moment :)


very nice way to do this.

This revision is now accepted and ready to land.Dec 21 2016, 10:23 PM
dduvall marked an inline comment as done.
dduvall edited edge metadata.

Replace use of utils.ask with utils.confirm

dduvall added inline comments.

Hmm, yeah that's a side effect introduced in this most recent refactoring of reusing _execute_for_group instead of having a specific rollback method. It wasn't intentional but I think I'm also ok with the tradeoff of connection overhead for now.



dduvall marked an inline comment as done.
dduvall updated this object.

Updating Diff summary to be consistent with most recent commit message. I'm probably doing this wrong! So weird.

This revision was automatically updated to reflect the committed changes.