Page MenuHomePhabricator

Merge to deployed branches instead of cutting a new deployment branch every week.
Closed, DeclinedPublic

Description

Instead of creating new deployment branches every Tuesday, I'd like to instead maintain 3 longer-lived deployment branches:

To continue the same release cadence that we currently have in place, we would do this instead:

  • Merge from master into group0 on Tuesday.
  • Merge from group0 to group1 on Wednesday.
  • Merge from group1 to group2 on Thursday.

What has to change

Current practices during swats often involve either committing the same change twice (once on master, once on deployment branch) or cherry-picking the commit from master to the deployment branch. Neither of these practices are really practical for long lived branches. Since master will be merged into deployment, we end up with the same change happening twice and that results in a merge conflict. Since cherry-picking introduces the same change but isn't tracked by git, this too causes merge conflicts.

So the solution is for hotfixes to be prepared as follows:

  1. Branch from master -> new topic branch
  2. Make the change and commit
  3. Merge your topic branch to master
  4. During SWAT, merge that same topic branch into the deployment branch.

This will ensure clean merges and sane branch history.

Historical rambling follows:

The way we deploy MediaWiki is horribly convoluted, tedious and error prone. The process has grown a lot of cruft over the years and very little has gotten cleaned up. Those who have had to deal with the complexity have been more or less content to live with it / too busy to do anything about it.

As the new guy saddled with the weekly 'train deployment' responsibilities, I am not numb to the problems and I'm not content to continue with a system that is so badly broken.

Problems with the current system

For those who are not fully familiar with the process, see Train_deploys. Even after multiple deploys it's still very difficult to follow without making any mistakes.

  • Way too many steps - it's time consuming, tedious and error-prone
  • Steep learning curve, low bus-factor
  • Very fragile, lots of opportunities to kill entire groups of wikis.
    • Missing a step easily leads to breaking production wikis with no warning and no immediate indication that anything went wrong
    • Worst case scenario: a single miss-typed command could bring down all wikipedia.org wikis.
    • Don't take my word for it, read this: P4469 (IBrokeWikipediaList; original).
  • Every week we create a full clone of MediaWiki core, plus one for each of the deployed extensions.
    • This is slow and wastes a lot of storage/bandwidth.
    • Even worse, this stresses gerrit and in turn lowers everyone's productivity by delaying CI test results, slowing developer commits, pulls, code reviews, merges...
  • We create a new branch on every deployed extension, then proceed to pin them to a specific commit (rather than a branch) via submodules. Branches are not necessary or even appropriate. We are just creating lots of digital garbage that won't ever be collected. Tags would be much more appropriate for marking weekly release milestones.
  • There is no automated clean-up of old data, so removing old branches and related cached files must be performed manually (yet another error prone and easily overlooked task)
  • Security patches aren't automatically carried forward from one week to the next, the must be manually applied each time we cut a new branch.

Proposed improvements

  • Use git-new-workdir instead of cloning the entire remote repo each time we push a new release
  • Use tags instead of a new branch for each weekly revision. We should only branch for a new 1.x version number, a weekly milestone hardly justifies an entirely new branch.
  • We need to deal with the multiple versions of mediawiki in production symbolically instead of referring to specific versions.
  • The deployment process should be one or two commands at most, not a whole series of complex and interdependent commands interrupted by gerrit submissions, rollbacks, +2ing of one's own patches, etc.
  • security patches that aren't merged in gerrit should be carried forward automatically, no intervention required.

Branching

For a history lesson in how we got to where we are, see this mailing list thread: MediaWiki core deployments starting in April, and how that might work

@RobLa-WMF wrote:

One plan would be to have a "wmf" branch that does not trail far
behind the master. The extensions we deploy to the cluster can be
included as submodules for that given branch. The process for
deployment at that point will be "merge from master" or "update
submodule reference" on the wmf branch. Then on fenari, you will git
pull and git submodule update before scapping like you're currently
used to. The downside of this approach is that there's not an obvious
way to have multiple production branches in play (heterogeneous
deploy). Seems solvable (e.g wmf1, wmf2, etc), but that also seems
messy.

...

Another possible plan would be to have something somewhat closer to
what we have today, with new branches off of trunk for each
deployment, and deployments happening as frequently as weekly.
master
├── 1.20wmf01
├── 1.20wmf02
├── 1.20wmf03
...
├── 1.20wmf11
├── 1.20wmf12
├── REL1_20
├── 1.21wmf01
├── 1.21wmf02
├── 1.21wmf03
...

The conclusion was that it was decided to go with option #2. Honestly I'm not sure this was the right choice. I think we should have 3 "release" branches, instead of constantly making new ones, and use tags for the release pointers.

One branch represents each 'staged' group of wikis, referred to as group 0, 1 and 2 in the current system.

  • release-staging (Group 0)
  • release-next (Group 1)
  • release-stable (Group 2)

Assuming we were to maintain the current release schedule as-is, the process would look something like this:

Tuesday

  • merge staging -> next
    • tag the head of next with a 1.25wmfXX-next release tag
    • move group 1 wikis to the new tag

Wednesday

  • merge: next -> stable
    • tag the head of stable with a 1.25wmfN-stable release tag
    • move group 2 to the new stable tag
  • merge: master -> staging
    • tag head of -staging with a 1.25wmfXX-staging release tag
    • move group 0 wikis to the new tag

Pictures really are worth 1000 words (or at least a few hundred)

This is essentially the same thing as gitflo but using 2 staging branches.

Static Files

Static files are currently served from static-1.25wmfXX versioned directories with varnish in front caching for up to 30 days, which necessitates keeping around several complete copies of the core repo and all extensions, one for each revision that has been deployed within the past 30 days.

These are currently handled somewhat manually, along with the php-1.25wmfXX code checkouts. One elegant solution to this would be to serve the static files directly from a bare git repository. the first path component of the url could represent a git tag with the remaining path matching a file within the repository at the revision matching the requested tag.

mod_git

We could probably implement this using an apache module that calls libgit2. mod_git looks like it would work, with just a few minor modifications:

  1. mod_git uses a cookie to supply the git tag, we would want to use part of the url.
  2. don't serve php files (return 500 errors?)
  3. only allow release tags to be specified? mod_git currently will serve any branch, tag, or specific commit hash.
  4. We would probably need to perform a security audit and possibly some performance optimization.

This would drastically simplify deployment of static files and it would eliminate the need to periodically clean up a bunch of stale static files that get left behind from old deploys.

php-git2

Another option would be to serve the files from git using php-git2 - the libgit2 bindings for php5.3

Related tasks

There are a bunch of tasks related to overhauling the deployment systems and processes. Here are a few relevant links: T97068, T94620, T93428, T95375

Related Objects

StatusAssignedTask
OpenNone
OpenNone
StalledNone
ResolvedLegoktm
Resolved GWicke
OpenNone
OpenNone
Resolveddemon
Declinedmmodell
Declinedmmodell
InvalidNone
Resolvedmmodell
ResolvedJdforrester-WMF
Declinedmmodell
Resolvedmmodell
Resolvedmmodell
Resolvedmmodell
Resolveddduvall
ResolvedKrinkle
ResolvedKrinkle
Resolvedmmodell
DuplicateKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedMaxSem
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Yurik added a subscriber: Yurik.Apr 28 2015, 11:38 AM

Lets call this PHP deployment process. Our services deployment may also need rethinking - it takes nearly forever to deploy any new service -- whereas as we remember from Hackathon, Lyft can do it in two weeks.

I agree it feels very much mediawiki/php deployment, though some points might overlap with T93428

mmodell renamed this task from Rethinking our deployment process to Rethinking mediawiki deployment process.May 9 2015, 5:47 AM
mmodell updated the task description. (Show Details)

ok I updated the title to clarify that this was about mediawiki deployment, and even more specifically, it's more about the process than the tooling (tooling has it's own task).

hashar raised the priority of this task from Normal to High.May 29 2015, 4:27 PM
hashar moved this task from INBOX to In-progress on the Release-Engineering-Team board.
mmodell lowered the priority of this task from High to Low.Jun 18 2015, 8:07 PM
mmodell renamed this task from Rethinking mediawiki deployment process to Rethinking mediawiki deployment branches and release process.Jul 16 2015, 9:08 AM

What is the expected timeline on using long-lived/re-used branches in production?

Please beware that moving to that system must be blocked on a solution for T99096 as otherwise our weekly bugs about static deployments not working will be eternal. @ori and I have a few ideas to solve it, but aren't prioritising it at this point, but we can move it up based on the timeline for this. See furhter at T99096.

@Krinkle: I don't know a specific timeline, but I'd like to hear about your ideas for solving T99096: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime

mmodell added a comment.EditedJul 16 2015, 4:49 PM

In {T104398#1456723} I elaborate a bit about how the long-lived branches might be managed.

mmodell renamed this task from Rethinking mediawiki deployment branches and release process to Merge to deployed branches instead of a new deployment branch every week..Aug 13 2015, 3:44 PM
mmodell updated the task description. (Show Details)

How are we planning to tag commits and when they go into production if we only have one branch? Is the plan to still do this on a weekly all-at-once cycle, or deploy-as-merged-plus-a-window, or what?

How are we planning to tag commits and when they go into production if we only have one branch? Is the plan to still do this on a weekly all-at-once cycle, or deploy-as-merged-plus-a-window, or what?

There would be 3 branches, wmf/group0, wmf/group1, and wmf/group2

The deployment train process/schedule should look something like this:

Early TuesdayTag HEAD with 1.26.0-wmf.n and run integration tests against the resulting tag
Later Tuesday Merge 1.26.0-wmf.n wmf/group0
Wednesday Merge wmf/group0wmf/group1
Thursday Merge wmf/group1wmf/group2

For swat, ideally a hotfix would look like this:

Branch/MergeFromTo
Branchmaster hotfix_xyz
Commit
Mergehotfix_xyz master
Mergehotfix_xyz wmf/group0
Mergehotfix_xyz wmf/group1
...
mmodell renamed this task from Merge to deployed branches instead of a new deployment branch every week. to Merge to deployed branches instead of cutting a new deployment branch every week..Aug 20 2015, 4:04 AM
mmodell updated the task description. (Show Details)

This means that we will lose the feature of ForrestReleaseTaggerBot telling us to which versions patches were back-ported.

Worked example:

  • Production is at 1.26-0-wmf20 for group 0 and 1.26-0-wmf19 for groups1 and …2
  • Patch is written, reviewed and merged -> goes into master, and will eventually be part of the tag for 1.26-0-wmf21, so ForrestBot tags it as such.
  • Patch is back-ported to production, but the 1.26-0-wmf20 tag is already locked in without it (unlike a branch).
  • If you scour Phabricator, it's clear the patch was back-ported, but it's not tagged as such so it doesn't stand out (currently it's really obvious if a patch has multiple release tags that it was back-ported).

Options I can think of:

  1. Accept this loss of functionality
  2. Have sub-versions of tags, so 1.26-0-wmf20 -> 1.26-0-wmf20a -> 1.26-0-wmf20b -> …
  3. Make the SWAT process more heavy-weight, repudiating and updating the main tag
  4. Have ReleaseTaggerBot (or whatever) tag tasks as "in group0", "in group1", "in group2" when they hit those groups, This solves the issue of needing to cross-reference the Server Admin Log to know whether something went to all production, but it's not great.

Options I can think of:

  1. Accept this loss of functionality
  2. Have sub-versions of tags, so 1.26-0-wmf20 -> 1.26-0-wmf20a -> 1.26-0-wmf20b -> …
  3. Make the SWAT process more heavy-weight, repudiating and updating the main tag
  4. Have ReleaseTaggerBot (or whatever) tag tasks as "in group0", "in group1", "in group2" when they hit those groups, This solves the issue of needing to cross-reference the Server Admin Log to know whether something went to all production, but it's not great.

FWIW, Option #4 makes the most sense to me.

"but it's not great"

What is not-great about it? To me that seems much more straightforward and it conveys the important information, which is, where a change has been deployed and where it hasn't.

I'm also willing to consider #2 but that would result in a lot of tags so I'm not really a huge fan of that idea.

I also think option #4 is what we want. It's way better information than what we get now from @ReleaseTaggerBot. Instead of just knowing which release it's in (which you then have to cross reference with something like [[wikitech:Deployments]] or [[mw:MediaWiki_1.27/Roadmap]] you automatically know which group of wikis has it. And after you learn which wikis are in which groups then it's really straightforward.

greg added a comment.Mar 11 2016, 10:22 PM

This is basically the "Train 2.0" idea that we came up with during our annual planning exercise. I've created a wiki page about that project so that I can better plan out what work is needed (and then can figure out which quarter to schedule it in). See: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Train2.0

This comment is mostly just FYI. I haven't yet started trying to rip info out of this task into that page :)

For swat, ideally a hotfix would look like this:

Branch/MergeFromTo
Branchmaster hotfix_xyz
Commit
Mergehotfix_xyz master
Mergehotfix_xyz wmf/group0
Mergehotfix_xyz wmf/group1
...

This does't account for the fact that merging also brings in all other commits in the shared history.

If you don't want to involve cherry-pick, and have the benefit of native git references (for lookup of whether a commit is in a branch), then it seems the only way to do that is to ensure people rewind their local master back to at least wmf/group1 when creating the hotfix. Since people primarily write patches based on master, this seems impractical.

hashar added a subscriber: hashar.EditedMar 12 2016, 12:13 AM

(sorry wall of text, but I was really curious if we could keep the hotfix commit sha1 across branches. Turns out we can!)

If we wanted to keep the same commit across branches and do a merge, we will first have to find the best common ancestor of all four branches using git merge-base. Build the hotfix against that ancestor then propose four merge patches against each of the branches.

Taking mediawiki/core as it is right now and master / current wmf branches:

# Figure out the common ancestor:
$ git merge-base --octopus origin/master origin/wmf/1.27.0-wmf.{15,16}
f2d8fee03d484dc766d467dc631b5cc4bef1c510

# Create our hotfix based on that ancestor
$ git checkout -b hotfix f2d8fee0
$ git commit --allow-empty -m 'My hotfix'

You get your hotfix tested reviewed.

Then merge it in the various branches:

$ git checkout master && git merge --no-edit --no-ff hotfix
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
Already up-to-date!
Merge made by the 'recursive' strategy.
$ git checkout wmf/1.27.0-wmf.15 && git merge --no-edit --no-ff hotfix
Switched to branch 'wmf/1.27.0-wmf.15'
Your branch is up-to-date with 'origin/wmf/1.27.0-wmf.15'.
Already up-to-date!
Merge made by the 'recursive' strategy.
$ git checkout wmf/1.27.0-wmf.16 && git merge --no-edit --no-ff hotfix
Switched to branch 'wmf/1.27.0-wmf.16'
Your branch is up-to-date with 'origin/wmf/1.27.0-wmf.16'.
Already up-to-date!
Merge made by the 'recursive' strategy.

Confirm only your hotfix is merged in, all branches should just be ahead by two commits (the hotfix + the merge commit):

$ git branch --contains hotfix -vv
  hotfix            63b3a3a My hotfix
  master            f72a2a2 [origin/master: ahead 2] Merge branch 'hotfix'
  wmf/1.27.0-wmf.15 d7bfc71 [origin/wmf/1.27.0-wmf.15: ahead 2] Merge branch 'hotfix' into wmf/1.27.0-wmf.15
* wmf/1.27.0-wmf.16 9844463 [origin/wmf/1.27.0-wmf.16: ahead 2] Merge branch 'hotfix' into wmf/1.27.0-wmf.16

Send for review, wait for CI/QA, submit and happy end.


PRO

Definitely doable and the merge commit can be used to fix a conflict if needed. It is mathematically more respectful and give us a nice topology.

git branch --contains hotfix let you find out branches having the fix since the commit has been merged in each branches.

Useful for hotfix branch that have several commits.

BUT

I consider myself fairly fluent in git arcane and basic graph theory, still it took me 1+ hour to write this. The whole trick is really git merge-base and specially its --octopus option.

The checkout / merge --no-ff has to be done locally for each branch then send for review.

Gerrit has the cherry-pick button that "automatize" all the mess (just spam click).

You can still find whether a branch as the hotfix copied to using git-cherry <upstream> <head>.

Food for later:

  • git-merge lets you use a custom strategy (a script really) that could rely on the git merge-base --octopus option to automatize the mess
  • we could have a hotfix tool that cherry-pick the patch made on master against the common ancestor, craft the merge commit and propose them to the group[0-2] branches for us

@hashar: You're my hero. This is awesome

Krinkle removed a subscriber: Krinkle.Mar 29 2016, 12:35 AM

Even worse, this stresses gerrit and in turn lowers everyone's productivity by delaying CI test results, slowing developer commits, pulls, code reviews, merges...

Most of that is not true, except git remote update on a slow link and by a quick estimate even there the branch count we have currently is so small that the changed content dominates the bandwidth use.

But there is quite a bit of content and its history that is only in deployment branches. If splitting that off into another repo is a significant enough save to be worth it would need to be tested. This can be done independently of this ticket. People that want can then use 2 remotes deployment and core in the same clone. However gerrit will loose its ability to show for a core commit which deployment branch it is contained in, as AFAIK that feature is not cross repo. Diffusion might cope in that regard but might have a problem with choosing where to link a commit to that appears in 2 repos.

Anyway if you are concerned about the above the solution in this ticket will prevent us from just moving older wmf deployment branches to a historical repo, i.e. will make it worse.

Security patches aren't automatically carried forward from one week to the next, the must be manually applied each time we cut a new branch.

Is getting such a merge as proposed in this task right easier than 12 cherry-picks? Not running our CI after security patches is already a problem. Not running our CI after a merge and not running it anymore for new deployment branches would make that even worse.

Since master will be merged into deployment, we end up with the same change happening twice and that results in a merge conflict.

This is not necessary, as the git merge strategy "ours" can make a merge that results in the content being that of exactly a specified branch involved in the merge. An example can be seen in ed7a17415825264c2e45bd5b9cb32457f76253ba. The downside of that is that you need to reapply security patches.

I was really curious if we could keep the hotfix commit sha1 across branches. Turns out we can!

Excellent comment! In practice patches will be merged in master before someone knows that it will need to be back ported.

Ltrlg added a subscriber: Ltrlg.Jul 9 2016, 7:39 AM
mmodell raised the priority of this task from Normal to High.Jul 11 2016, 9:50 PM
Krinkle updated the task description. (Show Details)Nov 18 2016, 3:10 AM

For swat, ideally a hotfix would look like this:

Branch/MergeFromTo
Branchmaster hotfix_xyz
Commit
Mergehotfix_xyz master
Mergehotfix_xyz wmf/group0
Mergehotfix_xyz wmf/group1
...

This does't account for the fact that merging also brings in all other commits in the shared history.
If you don't want to involve cherry-pick, and have the benefit of native git references (for lookup of whether a commit is in a branch), then it seems the only way to do that is to ensure people rewind their local master back to at least wmf/group1 when creating the hotfix.

That seems reasonable to me.

Since people primarily write patches based on master, this seems impractical.

It's impractical to prescribe a preferred git workflow and ask committers to submit against the appropriate branch?

Since people primarily write patches based on master, this seems impractical.

It's impractical to prescribe a preferred git workflow and ask committers to submit against the appropriate branch?

Yes. People write patches against master. And from what I read, we don't plan to change this, right? Which means we'd start a fragile trend of using git-merge to apply patches from master to wmf branches. Which will silently bring in unrelated commits - which hard to detect, avoid or verify. Hard to do manually. Even harder (or impossible) within Gerrit or Phabricator.

Unless we start a trend where the only way to land a patch in master is to commit and backport via the oldest current wmf branch. Except if the commit is not meant to go to prod directly. But then again, this decision is not always known ahead of time. I don't see how this plan would eliminate even half of our cherry-picks, let alone all.

I think this model could work well for our release maintenance branches. But I think wmf branches and master are too close to each other for this to be practical.

I emphasise with the current set problems and fragilities, and I agree we should solve them, soon. But we should do so with those problems in mind. Cleanliness of our Git history is not one of those problems and should not be used as main justification. I think we do have a pretty clean Git history today. Being more strict about component prefixes and avoiding merge commits would make it even cleaner. This plan, however, hardly changes the history. We'll still see the same commit messages in the master and wmf branches, with the same number of merge commits. It's just that the commit hashes aren't always the same, and now (sometimes) would be.

Which of the problems is solved by adopting this backport process? The steps for SWAT remain unchanged (except git cherry-pick > git merge). The steps for a new train branch are unchanged and equally complicated, too.

Re-using deployment directories would reduce complexity, but that doesn't require re-using branch names. We could easily checkout the next branch in the same directories each week. This would already solve the problem of large bandwidth use from gerrit to tin and from tin to app servers.

Not having many wmf branches would make the repo cleaner, and allows for git-gc to trash the old cherry-picks. That can be accomplished by removing old branches more frequently in today's model, too.

Downsides:

  • Two ways to make a backport for SWAT, not one. Depending on if, when, and with what parent, the commit was merged in master – choosing the wrong strategy for a commit would silently roll out master commits to production without beta testing.
  • Two ways to find whether a change was backported, not one.
  • Different strategy for MediaWiki release branches as for WMF deployment branches.

Note that Node.js has adopted the same process as we currently have (sans the merge commits). They cherry-pick from master to maintenance branches. I believe this is the defacto Git workflow that new comers expect when they have non-zero Git and open-source experience.

I think we could adopt the traditional "git/git" workflow (git workflow) for MediaWiki releases (e.g. we'd have a maint branch that tracks the oldest release currently supported, always merge important bug fixes there first and then git-merge upwards). However even if we do this for release branches, I'd argue it's still not practical to apply to WMF deployment branches.


Links:

@Krinkle: Thanks for the detailed response. I agree with much of what you've written and I'll have more to say after letting it sink in a bit (and reading the references you've linked to.)

I also want to point out the most recent work related to reducing the number of wmf branches: T147478: Flatten MediaWiki config, all MediaWiki versions, and extensions into a unified git repo which is based on @bd808's idea for syncing deployment branches with git instead of rsync. @thcipriani has built a prototype in D429: Flatten MedaWiki deploy into a single git repo and it looks like the most promising path forward for reducing deployment complexity.

I feel like we are finally starting to get the complexity under control thanks to:

Once all these pieces come together and we hopefully stramline the backports and SWAT processes then I think we'll be in really good shape.

mmodell closed this task as Declined.Feb 14 2017, 2:02 PM

Looks like this simply isn't going to happen.

hashar removed a subscriber: hashar.Mar 14 2017, 3:38 PM
Imarlier removed a subscriber: Imarlier.