Scap rollback fails after promote completes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Arlolra
	Oct 24 2016, 8:40 PM

Description

== DEFAULT4 ==
:* wtp1006.eqiad.wmnet
:* wtp1017.eqiad.wmnet
:* wtp2006.codfw.wmnet
:* wtp1014.eqiad.wmnet
:* wtp1005.eqiad.wmnet
:* wtp2017.codfw.wmnet
:* wtp1010.eqiad.wmnet
parsoid/deploy: fetch stage(s): 100% (ok: 7; fail: 0; left: 0)                  
parsoid/deploy: config_deploy stage(s): 100% (ok: 7; fail: 0; left: 0)          
20:29:29 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default4', 'promote', '--refresh-config'] on wtp2017.codfw.wmnet returned [1]: 
parsoid/deploy: promote and restart_service stage(s): 100% (ok: 6; fail: 1; left: 0)
20:29:29 1 targets had deploy errors
Stage 'promote' failed on group 'default4'. Perform rollback? [y]: y
parsoid/deploy: rollback stage(s): 100% (ok: 7; fail: 0; left: 0)               
20:33:03 Finished Deploy: parsoid/deploy (duration: 08m 21s)

This may be a duplicate of T149008, but this time the rollback didn't happen when it was explicitly requested. The deploy finished but the targets were confirmed to still be on the new commit.

Related Objects

Mentioned In: T150267: scap rollback behavior problems
T149115: Deploy failed on wtp2017.codfw.wmnet
T149008: Canary doesn't rollback if you don't continue
Mentioned Here: T145460: Rollback failed when target is down
T149008: Canary doesn't rollback if you don't continue

Event Timeline

Arlolra created this task.Oct 24 2016, 8:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2016, 8:40 PM

This one seems different than T145460: Rollback failed when target is down but for the same reasons. As part of the promote stage we remove the .in-progress symlink. Without an .in-progress symlink, scap doesn't realize it can roll back.

We will get the rollback fixes on the roadmap, these are bad behavior and are unexpected.

In the interim, you can work around by deploying a specific revision using the --rev flag to rollback. It should work with things like scap deploy --rev HEAD^. Not an ideal solution, especially in a panic rollback scenario, but does allow moving back to a specific commit.

Adding @dduvall if case he has any thoughts about rollback behavior.

thcipriani mentioned this in T149008: Canary doesn't rollback if you don't continue.Oct 24 2016, 9:12 PM

Fair enough. Thanks @thcipriani

In T149012#2739787, @thcipriani wrote:

Adding @dduvall if case he has any thoughts about rollback behavior.

Oh state...

There are currently a few final operations tacked on to the end of the local target deploy process: removal of the .in-progress link, creation of the .done link, and cleanup of old rev directories. Coupling these operations with that of the last deploy stage means that the "done" state can reflect only the deploy result of each target on its own and not the overall deploy result across targets.

One possible refactoring I can think of would be to change this behavior by decoupling those state-keeping and cleanup operations into a first-class "finalize" or "cleanup" stage. It would add slightly more overhead—an additional SSH connection/execution per target host—but would make post-promote rollback possible with the current rollback design.

Another option might be to dumb down the rollback deploy-local process and just have the main deploy process tell it which rev to rollback to, kind of like the workaround you suggested. Though like you mentioned, it's not ideal when you want to ensure the consistency of the rollback result to something proven to be safe, a repo state otherwise untouched by the current scap run.

In T149012#2740375, @dduvall wrote:

One possible refactoring I can think of would be to change this behavior by decoupling those state-keeping and cleanup operations into a first-class "finalize" or "cleanup" stage. It would add slightly more overhead—an additional SSH connection/execution per target host—but would make post-promote rollback possible with the current rollback design.

This was definitely my first thought as well. Seems like the easiest way to make things behave as expected.

Arlolra mentioned this in T149115: Deploy failed on wtp2017.codfw.wmnet.Oct 25 2016, 8:29 PM

thcipriani mentioned this in T150267: scap rollback behavior problems.Nov 8 2016, 4:37 PM

Implemented in {D439}

Scap rollback fails after promote completesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Scap rollback fails after promote completes
Closed, ResolvedPublic
Actions