Page MenuHomePhabricator

Take heat off day before the weekly branch-cut?
Open, Needs TriagePublic

Description

The Mondays and Tuesday morning (SF-time) are in my opinion quite "hot" in terms of Code-Review stress.

Unlike any other day, code merged into master during this 24-hour period will go live in production very soon without much time for developers to discover friction in their local environments and/or for QA to find issues on the Beta Cluster.

The current reality is that our automated systems and Code-Review process are not good enough to provide a high enough confidence for immediate production rollout from master. We need testing in Beta and allow some time for post-merge review by maintainers of specific areas subscribed to merges in Gerrit. This is in part due to the wide hand-out of CR+2 rights to mediawiki/* repositories and the monolithic codebase of MediaWiki core.

As such, I propose that we change the way we cut branches and deploy them. I'm not sure about the specifics, but the end goal of this particular task is to ensure all commits receive at least 2-3 days of exposure to other developer and Beta Cluster before they end up running in a production environment. Regardless of whether it is just mediawiki.org and test wikis. And even ignoring test wikis, something merged Monday evening/Tuesday morning still hits production wikis like Wiktionary and Commons on Wednesday, which is 1 - 1.5 day post-merge, not 3 days.

One initial idea:

  • Cut branch on Thursday instead of Monday.

This means new features and changes merged on Friday and Monday don't have the heat of going live very soon without any exposure in Beta Cluster and to developers' local environment. The downside is that any bug fixes we merge between Thursday and Monday we have to remember to back port. This gets harder as we won't notice them lacking a fix in Beta as the fix will be live there (with Beta running master).

As a general habit however, I think we should be clearer about identifying bug fixes from enhancements and always back port them to the latest wmf branch, no matter what day it is. So perhaps that's not an issue.

Alternatively, we could consider having Beta Cluster always run on the wmf+1 branch (which will be a non-prod branch between Thursday and Tuesday, a semi-prod branch on Tuesday evening and Wednesday, and Thursday it'll be the first to receive the next branch).

Event Timeline

Krinkle created this task.Nov 9 2015, 10:38 PM
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 9 2015, 10:38 PM

The proposed end goal might cause more SWAT deploys will be used instead, working against the original goal.

I am also cautious whether staying longer at Beta Cluster will help. Maybe there are ways to make catching of issues at Beta Cluster more efficient, so that everything that will be caught will be caught as soon as possible, say, in 24 hours.

I support cutting branch earlier before deployment, preferable at a regular, known time, so people can easily plan for it.

greg added a subscriber: greg.Nov 18 2015, 5:50 PM

We can easily cut the next branch before we deploy it (say, on Friday for the upcoming Tuesday train start) but we'll need yet-another-Beta-Cluster to test it on since we'll want the current Beta Cluster to continue to get all new code updates from master as they happen (modulo 10 minutes).

In fact, this was a planned goal for the "staging" cluster: to be a place where we could eg: freeze code the day before a deploy, run all of our integration/browser tests there, and then have a good feel for what we're deploying.

We can easily cut the next branch before we deploy it (say, on Friday for the upcoming Tuesday train start) but we'll need yet-another-Beta-Cluster to test it on since we'll want the current Beta Cluster to continue to get all new code updates from master as they happen (modulo 10 minutes).
In fact, this was a planned goal for the "staging" cluster: to be a place where we could eg: freeze code the day before a deploy, run all of our integration/browser tests there, and then have a good feel for what we're deploying.

Just an idea but, if we're still too far away from sharing enough puppet code with prod and beta (and as such, farther away from easily creating more clusters), we could re-use the existing beta cluster meanwhile.

Like with prod, we could use multiversion in beta too. Requests for *.beta.wmflabs.org and *.staging.wmflabs.org would route through the same apaches and pick a different MediaWiki version based on the hostname.

greg added a comment.Mar 22 2016, 8:49 PM

Like with prod, we could use multiversion in beta too. Requests for *.beta.wmflabs.org and *.staging.wmflabs.org would route through the same apaches and pick a different MediaWiki version based on the hostname.

Yeah, that's a possibility as well.

Krinkle updated the task description. (Show Details)Jul 15 2019, 11:41 PM
Krinkle added a comment.EditedJul 15 2019, 11:52 PM

Another idea for the short-to-mid term could be to have Beta Cluster only run master Tuesday-Thursday, but branch+1 from Friday onwards. Thus essentially using Beta Cluster as a "group -1" to mature the branch through the Friday and Monday workdays.

This would have the benefit of both: 1) A sort of "feature freeze" for two workdays during which only bugs are backported, and 2) a way for QA to increase their confidence level about the upcoming branch and to know when issues haven't been backported because they'd see the issue on Beta Cluster – knowing that that state will go live as-is.

Today the only representative state that goes live is during the minutes leading up to the branch cut on Tuesday. And of course, it's about more than just quality assurance. It's also about performance and production security, especially with our growing development community and merge access. In my opinion our current set up is a recipe waiting to blow up with a perfect storm.

I'd like to see the cost-benefits balanced so we can make an informed short-term decision. I can probably carve out some time to help with implementation if needed as well.