Page MenuHomePhabricator

Avoid unfinished train deploys over holidays, weekends, or other stretches of no-deploy days
Closed, DeclinedPublic

Description

Motivation

What went poorly? […]

  • Patches from today's codebase had to be ported to the master state of 2 weeks ago and 3 weeks ago, because production was left in a multi-version state over both a weekend and a full no-deploy week. Managing multple versions is inevitable to some extent, but in general Tue-Wed-Thu already feels like a long enough time to juggle two branches for, nevermind 3+ weeks.

Proposal

To adopt a policy mandating us to not leave the train paused mid-way over one or more no-deploy days, such as holidays, weekends and other days/weeks of no-deploy time.

By Friday (assuming a regular work week) we must either roll forward or roll back. Weekend incident investigation should never have to deal with a multi-version deployment. Even if the train is not blocked and there simply wasn't time to roll out completely, either rollback or roll forward.

Event Timeline

Do you mean that if by Friday, we haven't gotten to group2, we roll back all groups to the previous train version? And start over from testwikis and group0 on Monday?

Do you mean that if by Friday, we haven't gotten to group2, we roll back all groups to the previous train version? And start over from testwikis and group0 on Monday?

Yes. Or even straight back to group1 on Monday, or to balance the risk and move to group2 on Thursday/Friday after all. As long as it doesn't leave it split over any no-deploy days.

Krinkle renamed this task from Adopt policy to not leave train deploys unfinished over holidays, weekends, or other stretches of no-deploy days. to Avoid unfinished train deploys over holidays, weekends, or other stretches of no-deploy days.Sep 10 2020, 3:53 PM

Having pondered this during last week, when I was the train conductor, I have no strong opinions for or against.

I'm happy to leave this for others to decide.

brennen moved this task from Backlog to Radar on the User-brennen board.
brennen subscribed.

Sounds like a good idea to me. The only possible hiccup might be that if we end up rolling everything back more regularly we need to be more mindful of any config changes that also need to be reverted.

Particularly if the change was targeting a group0 wiki early in the week and so by Friday the need to revert it is not at the fore of anyone's mind.

I think that while we should try to avoid such a situation, mandating we either roll forward or back by policy would just be removing the ability for people managing releases to make a judgement call, which is almost never a good idea. And this is not counting that in some cases, rolling back after days might be slightly problematic.

So: I agree with making this a general recommendation, but I strongly oppose making this a mandatory policy.

I agree with Joe. In most of the cases (?), it's not a good idea, and needs
constant attention, but we should allow releng to keep it that way in some
cases, if warranted for any reason.

If the new version is already on group1, the churn would be somewhat disruptive for those wikis.

Mentioned in SAL (#wikimedia-operations) [2021-02-11T23:44:47Z] <twentyafterfour> Train status for wmf.30 (T271344) is blocked until monday. leaving wmf.30 on group1 and wmf.27 on group2 in spite of T260401

After talking this through in the Release-Engineering-Team meeting I think we're going to decline this one.

I think it's reasonable to say that train state is sometimes confusing -- it's easy to check https://versions.toolforge.org/ -- but an educated guess about the current train status is often wrong. The desire for more predictability is a reasonable desire; however, it's important that additional predictability not come at the expense of stability.

For example, the incident to which this task is a response:

Patches from today's codebase had to be ported to the master state of 2 weeks ago and 3 weeks ago, because production was left in a multi-version state over both a weekend and a full no-deploy week.

If we had rolled back completely we'd still have to backport that code and juggle the branches. One of those branches wouldn't have been serving production traffic (which is good); however, the fact that a branch is not serving production traffic also creates the risk that it may not get the backport at all. That is, the fix gets backported to the version from 2 weeks ago (since it's the only version that's live) and merged in mainline branches, but not backported to the version we'd intend to roll forward after the weekend or holiday -- creating a different type of confusion.

This proposal also limits time a branch spends in production. Time on group1, even in a rollback state, provides valuable information about the train and production. Often bugs are surfaced in production by users exercising somewhat singular code paths. Any mandate that limits the time a branch spends in production comes at the expense of the additional information gleaned in the additional time.

I think that there are two laudable goals in this task:

  1. Avoid creating situations where we're running code from 3 weeks ago
  2. Branches currently in production should be predictable and easy to reason about

I'd be open to different ideas that address them.