Page MenuHomePhabricator

Rollback failed when target is down
Closed, ResolvedPublic


Before deploying, it was noted that wtp2019.codfw.wmnet was down, but since that's in the backup cluster, it seemed fine to proceed. However, mistakenly, the target was not removed before attempting the deploy.

The canaries deployed cleanly but the fetching stage failed for the first group, since the above node was part of it. When asked to rollback, the rollback failed for the same reason and the deploy exited with the canaries still on the new commit.

See the paste P4029 for the logs.

Maybe scap should verify all the targets are up before attempting the deploy?

Event Timeline

Arlolra created this task.Sep 12 2016, 9:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2016, 9:14 PM

Mentioned in SAL [2016-09-12T21:17:28Z] <arlolra> For completeness, "back" in my last log is a mistake. I scap deployed the wrong --rev, but that was ultimately the version we wanted deployed anyways, so no harm no foul. (T145460)

thcipriani triaged this task as Medium priority.Sep 13 2016, 3:40 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.
thcipriani added subscribers: dduvall, thcipriani.


rollback stage(s): 100% (ok: 6; fail: 1; left: 0)

tl;dr: I'm mostly offloading working memory to ticket format.

How this works

So currently the way this works is we keep 2 symlinks in the /srv/deployment/parsoid/deploy-cache directory during a deployment:

  1. .in-progress
  2. .done

The .in-progress symlink is created after we fetch the new code to the machine, but before the new code is checked out or the service is restarted. It points to the revision currently being deployed in /srv/deployment/parsoid/deploy-cache/revs.

.done points to the last revision that was successfully deployed.

After the new code is deployed and all checks are passed, the .in-progress symlink is moved to .done.

In the case of a rollback, we make sure that .done links to the same thing as current.

Why it's broken

If we were not able to make an .in-progress symlink, then there is no point in checking the .done symlink.

It should maybe be logged as a soft failure, but not halt the rollback process.

Maybe we should never halt a rollback process on an individual machine's failure.

@dduvall do you have any thoughts about the design here?

This task seems somewhat related to T145512: Allow failures for a percentage of targets.