Page MenuHomePhabricator

Scap rollback fails after promote completes
Closed, ResolvedPublic

Description

== DEFAULT4 ==
:* wtp1006.eqiad.wmnet
:* wtp1017.eqiad.wmnet
:* wtp2006.codfw.wmnet
:* wtp1014.eqiad.wmnet
:* wtp1005.eqiad.wmnet
:* wtp2017.codfw.wmnet
:* wtp1010.eqiad.wmnet
parsoid/deploy: fetch stage(s): 100% (ok: 7; fail: 0; left: 0)                  
parsoid/deploy: config_deploy stage(s): 100% (ok: 7; fail: 0; left: 0)          
20:29:29 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'parsoid/deploy', '-g', 'default4', 'promote', '--refresh-config'] on wtp2017.codfw.wmnet returned [1]: 
parsoid/deploy: promote and restart_service stage(s): 100% (ok: 6; fail: 1; left: 0)
20:29:29 1 targets had deploy errors
Stage 'promote' failed on group 'default4'. Perform rollback? [y]: y
parsoid/deploy: rollback stage(s): 100% (ok: 7; fail: 0; left: 0)               
20:33:03 Finished Deploy: parsoid/deploy (duration: 08m 21s)

This may be a duplicate of T149008, but this time the rollback didn't happen when it was explicitly requested. The deploy finished but the targets were confirmed to still be on the new commit.

Event Timeline

thcipriani triaged this task as Medium priority.Oct 24 2016, 9:08 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.
thcipriani subscribed.

This one seems different than T145460: Rollback failed when target is down but for the same reasons. As part of the promote stage we remove the .in-progress symlink. Without an .in-progress symlink, scap doesn't realize it can roll back.

We will get the rollback fixes on the roadmap, these are bad behavior and are unexpected.

In the interim, you can work around by deploying a specific revision using the --rev flag to rollback. It should work with things like scap deploy --rev HEAD^. Not an ideal solution, especially in a panic rollback scenario, but does allow moving back to a specific commit.

thcipriani renamed this task from Did not rollback? to Scap rollback fails after promote completes.Oct 24 2016, 9:08 PM
thcipriani added a subscriber: dduvall.

Adding @dduvall if case he has any thoughts about rollback behavior.

Adding @dduvall if case he has any thoughts about rollback behavior.

Oh state...

There are currently a few final operations tacked on to the end of the local target deploy process: removal of the .in-progress link, creation of the .done link, and cleanup of old rev directories. Coupling these operations with that of the last deploy stage means that the "done" state can reflect only the deploy result of each target on its own and not the overall deploy result across targets.

One possible refactoring I can think of would be to change this behavior by decoupling those state-keeping and cleanup operations into a first-class "finalize" or "cleanup" stage. It would add slightly more overhead—an additional SSH connection/execution per target host—but would make post-promote rollback possible with the current rollback design.

Another option might be to dumb down the rollback deploy-local process and just have the main deploy process tell it which rev to rollback to, kind of like the workaround you suggested. Though like you mentioned, it's not ideal when you want to ensure the consistency of the rollback result to something proven to be safe, a repo state otherwise untouched by the current scap run.

One possible refactoring I can think of would be to change this behavior by decoupling those state-keeping and cleanup operations into a first-class "finalize" or "cleanup" stage. It would add slightly more overhead—an additional SSH connection/execution per target host—but would make post-promote rollback possible with the current rollback design.

This was definitely my first thought as well. Seems like the easiest way to make things behave as expected.

dduvall claimed this task.

Implemented in {D439}