Page MenuHomePhabricator

[EPIC] Create a formal release process for MobileFrontend/Gather
Closed, ResolvedPublic

Description

As a Web developer I want a formal release process so that I can feel confident about the quality of the features that I'm releasing and to more easily keep track of what is being released and when.

Related work: T104324 [EPIC] Create a development staging environment and branch.

Desired outcomes
  • The release process is set up documented on the wiki (where code moves through and lives in the different stages (dev, review, sign-off, pre-prod, prod) and publicly communicated.
  • When a release is made a changelog is generated (something like https://www.mediawiki.org/wiki/MediaWiki_1.26/wmf11) and sent to the mailing list for exposure.

Event Timeline

phuedx raised the priority of this task from to Needs Triage.
phuedx updated the task description. (Show Details)
phuedx added subscribers: Aklapper, phuedx.
phuedx set Security to None.
phuedx updated the task description. (Show Details)

@greg: Would it be terrible for y'all to cut the release branches from the stable branch rather than the master branch for MobileFrontend, WikiGrok, and Gather?

@greg: Would it be terrible for y'all to cut the release branches from the stable branch rather than the master branch for MobileFrontend, WikiGrok, and Gather?

What is deployed on Beta Cluster?

@demon @mmodell @thcipriani @hashar: thoughts?

What is deployed on Beta Cluster?

master, which I reckon we'll need to run browser tests against.

The idea here is to add another (hopefully minor) step in releasing something proper, regardless of what process we land on. What I think's pretty darn neat about having a stable branch is that we have a bit more control over how long we have to actually QA a feature, which, recent history has told us, might be longer than we're currently given.

But where will stable be (browser|manual) tested?

On Beta Cluster 2, of course?! Production?! I'm not sure. Maybe running browser tests locally against master and stable gets deployed to Beta Cluster too.

@Jhernandez, @Jdlrobson: Your thoughts'd be appreciated too.

I am very happy to see this being raised by the mobile team toward enhancing our development / release workflow. That is really similar to the git branching model at at http://nvie.com/posts/a-successful-git-branching-model/ which we linked to extensively during the svn/git migration. The first picture is the summary. It makes total sense with the faster cadence of deployment we are introducing (and which will be shortened even further to daily if not hourly deploys).

A limit is that CI, Beta cluster, deployment tooling all assume master to be the reference branch. Adding an exception to use stable for a few repo is going to be an implementation nightmare. As a workaround, you could create a dev branch and merges it to master once stable. That is just a semantic change for you guys but should have zero impact on our side (as long as you don't change the default HEAD).

So in short, I am not a fan of introducing this solely for a few mobile extensions. I would rather see our whole community to come up with a written in stone workflow for our whole community. That is RFC worthy. Then we can work on migrating CI / beta / deployment to the model.

Can someone please create an epic task that will block this one? Then we can reach out to wikitech-l and architects.

I'm going to preempt @phuedx here and say that I don't think this is the way forward they want (after talking with him on IRC, but it is now post-work time for him), so stand down, @hashar ;)

@greg @hashar I'm going to try to explain what we actually want, and why we are asking for the branch (I'm smelling http://xyproblem.info/ here 😝)


How we develop, when&where is our code:

In mobile-web and gather we use scrum, and each sprint we have a board with columns like this one: https://phabricator.wikimedia.org/tag/gather_sprint_help!/

In the following table you can see the columns through which a task of ours can move (each task can have 1+ patches associated with it, in different states (merged, -1d, etc)):

In developmentNeeds developmentCode ReviewReady for sign offDone
What's here?WIP patchesPatches come here from Code Review / Sign off. Can be either tasks with -1d patches that need work or tasks that need more patches to be feature completeTask that has code to be reviewed by fellow developersTask that is reviewable by design/product/tech lead on beta-cluster+Verified task on beta-cluster+
StatusIncompleteBroken/IncompleteOK/Incomplete/BrokenTechnically OKOK
Where can this code be?Local dev branchesLocal branches / Master (Beta cluster or Production)Local branches / Master (Beta cluster or Production)Master (Beta cluster or Production)Master (Beta cluster or Production)

As you can see, we can have patches that have been cut to a production release but they haven't been signed off by product management/design/tech lead (tasks with patches can dance between in-development and ready-for-signoff and some of the patches can be in master even if not signed off).

This is because we only have master, for both our development and releasing needs.

Why is this happening to us (now) and what is the problem

Given recent changes on staffing and reorgs, we've found ourselves short in staff to keep the columns "Ready for signoff" cleared out, so it's happened several times that the production branch cut happened from master with patches that had not been reviewed by design/product/techlead and that had quality problems on those aspects, and were in production for a week+ before issues were detected.

Right now our development needs (having code on staging ready for signoff) and our releasing needs (having releases cut and deployed to production) are completely coupled, and are causing issues that we shouldn't have. We've avoided this in the past since we were well staffed on PMs and we deployed less. This is not the case now. We have little to no PMs and we usually deploy more often/faster (+100 on deployments, it's the right move and deploying more is better).

What and how to fix it (ideas welcome, this is where this task comes from)

What we want is to avoid having code automatically pushed to production since that usually means (now) that a bunch of patches go to the wikis without the proper quality checks, but we still want to be able to have those patches on staging (beta cluster) so that they can be tested by product, etc.

I agree with @hashar that ideally we would use a development model based on branches like that git flow, but we're far from that and gerrit doesn't help with it. Also, there is fear around rebasing hell from some devs, so we want a lighter weight approach.

We want to decouple the branch for deployments and the branch for development. We want to manually merge our dev branch into the deployment branch.

Ideally this is what we would do:

BranchDevelopmentMaster
WhereAuto to Beta clusterRides train with releng to prod
Code fromGerrit patchesMerged from Development

There's three concerns here:

  • Where Gerrit merges our patches to.
  • Where Beta cluster re-deploys our patches from.
  • Where Production gets cut from.

Currently all of these happen in master, we want to separate them to two branches, call them whatever, we don't care, we care about the separation:

  • Development:
    • Where Gerrit merges our patches to.
    • Where Beta cluster re-deploys our patches from.
  • Production:
    • Where Production gets cut from.

We're suggesting creating a stable branch and using that to cut production from, because it seems like the least amount of disruption/work given those three concerns. (Development = master, Production = stable)

If it is feasible or you guys (@hashar @greg) think it's worth doing a development branch and changing both where gerrit merges and where beta-cluster feeds from instead of changing where the deployments happen from, we're all fine with it. That way it would be (Development = dev, Production = master) (which IMO makes more sense, but it may be too much work).

We'd really appreciate your help on this, we're trying to improve our workflows and practices now that we are in a transition time and we've been given the opportunity to do so.

If I've missed something or not answered any questions please ask/correct me

We want to manually merge our dev branch into the deployment branch.

How would that work?

Developers write and review patch1, patch2 and patch3, which are then merged to dev. Product signs off on patch1 and patch3 but declines patch2. How do you update the deployment branch?

  • Cherry-pick the accepted patches? You now have have a "stable" branch, quite possibly the result of manual conflict resolution, that no one ever saw running.
  • Revert patch2 from dev, re-test, then force push? That's a lot of overhead and revert spam.
  • Fast-forward deployment to dev@patch1? Patch3 gets delayed a week, even though it is perfectly fine. And you still deploy a state that has not been heavily tested (there is a small but nonzero chance that patch1 broke something but patch2 or patch3 unbroke it so no one noticed).
  • Don't update at all, delaying patch1 and patch3 one week because patch2 is not ready?

IMO the better solution for this is to create a testing environment for individual unmerged patches and do design/product review before merge.

Having a stable branch for QA review would make sense (after all features have been accepted and merged, create a frozen branch, review/hotfix and then deploy), but the group0 wikis already serve that purpose. And we don't have resources for a QA review anyway.

ideally we would use a development model based on branches like that git flow, but we're far from that and gerrit doesn't help with it

I think it does. Gerrit kind of simulates feature branches for every pending commit: git-review -d <id> has the same effect as git checkout <branchname>; git pull would in a feature-branch-based setup. We could use that to implement something like T76245.

Sorry for the delay.

If I understand correctly, you want Beta Cluster to be running the dev branch while we push out to production the master branch? (ignore the rest if that's wrong ;) )

If that's correct: no :)

We have a hard rule that whatever will be deployed to production is on Beta Cluster. That is our integration environment that we use to test code before production.

I'm fine with you all having whatever test instance you maintain running your dev version for your sign off process, but master is what we deploy to BC and what we deploy to production.

We should use a magic script to generate a changeset log for each branch cut and mail it to the Reading list on Tuesdays.

We should use a magic script to generate a changeset log for each branch cut and mail it to the Reading list on Tuesdays.

Florianschmidtwelzow already creates release notes for wmf branches on the wiki: https://www.mediawiki.org/wiki/MediaWiki_1.26/wmf8 . They are crafted via the make-deploy-notes scripts under mediawiki/tools/release.git . Feel free to fill a task for Release-Engineering-Team and propose a patch :-)

Also @Legoktm has setup a bot that tags phabricator tasks with the branch they are going out on which helps that.

Sorry for the delay.

If I understand correctly, you want Beta Cluster to be running the dev branch while we push out to production the master branch? (ignore the rest if that's wrong ;) )

If that's correct: no :)

So @Jhernandez has written a great summary of what we want https://phabricator.wikimedia.org/T100296#1325052 and I'm worried it has been misinterpreted somehow?

/me throws hat into ring :-)
What's the point of test.wikipedia.org then?
Beta cluster from what I can see is primarily a tool to communicate to designers and product what has been implemented. For developers it is a tool to clarify whether tests are passing or not. It's not a very good tool for deciding what goes to production.

My vision of how this /should/ work is that the beta cluster should run a dev branch at all times. test wikipedia should run the master branch. Developers should be able to say exactly what they want on test wikipedia (e.g. what's on the master branch) and this is what gets pushed out to all users.

What scares you about such a model?
If we trust our tests, wouldn't it be sufficient in this situation for release engineering to say we will not deploy your extension X if your browser tests are failing on test wiki?

Happy to talk about this @greg in the office over a coffee in case there are some crossed wires/communication issue. I know that Airbnb for instance have a similar model to this - any developer can push a button that changes the master branch. After CI has kicked in and confirmed everything is okay it gets pushed to production.

We have a hard rule that whatever will be deployed to production is on Beta Cluster. That is our integration environment that we use to test code before production.

So... let me havea stable

I'm fine with you all having whatever test instance you maintain running your dev version for your sign off process, but master is what we deploy to BC and what we deploy to production.

If that's correct: no :)

What's the point of test.wikipedia.org then?

My vision of how this /should/ work is that the beta cluster should run a dev branch at all times. test wikipedia should run the master branch. Developers should be able to say exactly what they want on test wikipedia (e.g. what's on the master branch) and this is what gets pushed out to all users.

We have a hard rule that whatever will be deployed to production is on Beta Cluster. That is our integration environment that we use to test code before production.

So... let me havea stable

I think I boiled it down to the nitty gritty with that ^ :)

Basically, yes, I want that as well. I think that's the right model. Have Beta Cluster be "master all the time from everyone" for testing/PM stuff and then a staging cluster that has what will be going out to production next. Those *could* be different branches for different teams but many will probably just choose to use master everywhere. Until we get there this is what we have and what is the rule for getting code to production. Creating yet-another-beta-cluster is not easy (we took 3 months, with Yuvi's help, and didn't get across the finish line, and not because we were slacking, we had 2-3 releng'ers on the project).

Summary: yes, I want what you want (mostly, I think, at least the broad strokes), just going from here to there requires time and effort.

Good news: we're working to get there, actually. It might not seem like it from the "other side" (sorry, really, not trying to "other") but your desires are actually the end game for all of the work we're doing and planning every day.

<semi-offtopic rant>
Now, if only we didn't have to do everything else to get there like making isolated CI instances (which everyone also wants and is another step on the 'dependent path'), make our deployment tooling sane, help craft the Apps CI infrastructure, and migrate from gerrit to differential. All of those things everyone wants yesterday. We have a team that is smaller than the mobile team doing it all. We have to prioritize; that means not doing everything.

Sorry for the rant, it's quarterly planning season ;)
</rant>

We want to help here. I guess the question is how can we best pool our resources to get to the same place or closer to it together? Couldn't test be the staging cluster? Or does it have some other purpose that I don't know quite understand?

Note: Another option mobile web may want to explore is to use a dev branch and run a labs instance that syncs this (this has been done before by Flow for their frontend rewrite). Our designers and PMs can be pointed to that and officially sign things off for the master branch. It's relatively easy to change the default branch that git review pushes to.

@greg: What do you think of @Jdlrobson's suggestion above? We certainly have the tools to allow us to do that – Labs-Vagrant would make it easy, no?

As I've written in the description, this is about the Reading Web team wanting more confidence in the code that's on master. Our tooling, infrastructure, and process encourage us to merge early in order to test a feature in a production-like environment. Re-reading this task, it's clear to me how much weight I've given to the first two and not the latter, e.g. we could get more confidence in our code by:

  • requiring that all changes must be reviewed by at least two people, or
  • requiring that all changes must be made to a feature branch, which can only be merged after functional review (QA), or
  • have a consistent development environment and corpus of data for testing edge cases locally

I think that having a development branch will decouple our workflow from the release process. It'll give us the room we need to reflect and iterate on our workflow, if necessary, and it won't require anything from the already oversubscribed Release-Engineering-Team.

Yeah, I think the Staging cluster will provide a large part of what is needed here, process-wise. Isolated CI will also provide a ton as well but slightly less than Staging. I think to get to where we all want to go both are prereqs.

For now I think you all can get away with a labs host that you use to deploy dev to, but warning that you'll be the ones maintaining that :)

One hard request: Please don't do a ton of work in dev over a week then merge it into master right before the new train release; that's basically skipping all of the browser tests/etc that our CI provides. So, please do have everything you want to go out with the next train merged before Sunday evening (our browser tests run once/day spread out throughout the day).

btw/context: Here are the RelEng team's goals for next quarter (about 95% sure): https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/201516Q1

<semi-offtopic rant>
We have a team that is smaller than the mobile team doing it all. We have to prioritize; that means not doing everything.
</rant>

I just realized I forgot that releng is smaller than *either* of the mobile teams (apps has 7, web has 7, releng has 6). :P

Anywho...

Jon and I chatted yesterday over coffee (always good, Jon). Here's some of the thinking post macchiato:

Short term plan

  1. The MFE team is welcome to spin up their own labs instance where they deploy dev and use that to do their sign-offs
  2. The MFE team is welcome to write their own bot that listens for Gerrit changes and deploys them to that instance then kicks off some local (quick/smoke) browser tests
  3. That bot would ideally run for every patchset and report back to gerrit. It won't block merge, it'll just report back if the tests passed/failed
  4. This could be an interesting proof of concept for the idea of "do browser tests, and by extension code quality, improve when developers get quicker feedback on them in a more controlled environment than Beta Cluster?" (I think we all know the answer is probably "yes", but, it's great to validate the idea).

(note: the maintenance burden of the labs instance and the bot falls on the MFE team)

Long term plan

  1. The RelEng team will continue to work on Isolated CI
    • Additional help here appreciated, but understandable if you all don't have the required skill set
    • This will give us the ability to run (quick/smoke) browser tests for every patchset for everyone
  2. The RelEng team will prioritize Staging as soon as possible (not Q1, unless we work magic on our other goals)
    • This will give us/you the ability to run your dev branch on Beta Cluster with all the other deployed extensions/etc and master on Staging (aka, "pre-prod" or however you want to think about it)

(sorry I'm using "MFE team" throughout, I know ya'll are merged together now, but, it's just clearer to me who is actually doing it, let me know if I should change how I refer/think about this.)
(let me know if any of the above is wrong, @Jdlrobson, I can edit my comment)

Thanks for keeping the conversation going.

That seems like a nice summary @greg. If everything goes well we'll get
some time to set this up at the beginning of the quarter and we'll document
it and be really public about the process and workflows we're doing, and
the pains/benefits we encounter along the way.

Jhernandez moved this task from Q1 Goals to Backlog on the Reading-Web-Planning board.
Jhernandez claimed this task.

I suggest we do a different task, they are free, and resolving tasks is awesome for morale.