Page MenuHomePhabricator

scap rollback behavior problems
Closed, ResolvedPublic

Description

There have been a few problems with the scap rollback behavior that have been noted by parsoid:

  1. T145460: Rollback failed when target is down
  2. T149012: Scap rollback fails after promote completes

Both of those issues are symptoms of 2 broader problems:

  1. Rollback only happens per group – this means that when you have group_size: 7 with 40 hosts and there is a failure in group 5, only those hosts in group 5 get rolled back to the previous revision, meaning groups 1-4 are running a new version, groups 5-7 are running the old version, and scap calls this state a successful rollback.
  2. Rollback relies on 2 flags .in-progress and .done that are set on the host as part of the fetch and promote stage, respectively. This means that if the promote stage fails on a single host within a group, most likely the .in-progress flag will have already been removed from the other hosts in the group, which means there is no previous revision for those hosts to rollback to. Likewise, (but less critically) if the fetch stage fails for a host, then the .in-progress flag will fail to be set for that host, so rollback will fail for the host for which the .in-progress flag was not set.

Revisions and Commits

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani moved this task from Needs triage to Debt on the Scap board.
thcipriani added a revision: Restricted Differential Revision.

maybe the in_progress flag should be a log instead of a lock file?

Because transactional transactions are tautological