There have been a few problems with the scap rollback behavior that have been noted by parsoid:
Both of those issues are symptoms of 2 broader problems:
- Rollback only happens per group – this means that when you have group_size: 7 with 40 hosts and there is a failure in group 5, only those hosts in group 5 get rolled back to the previous revision, meaning groups 1-4 are running a new version, groups 5-7 are running the old version, and scap calls this state a successful rollback.
- Rollback relies on 2 flags .in-progress and .done that are set on the host as part of the fetch and promote stage, respectively. This means that if the promote stage fails on a single host within a group, most likely the .in-progress flag will have already been removed from the other hosts in the group, which means there is no previous revision for those hosts to rollback to. Likewise, (but less critically) if the fetch stage fails for a host, then the .in-progress flag will fail to be set for that host, so rollback will fail for the host for which the .in-progress flag was not set.