Deploy MW+Extensions by percentage of users (instead of by domain/wiki)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	greg
	Jun 30 2015, 10:51 PM

Description

(I'm surprised I can't find a task about this already since we've talked about this forever; if I just duplicated one, let me know.)

As a deployer I want to release new code to a small subset of users across all wiki projects (eg 5%) and gradually increase that, watching for an increase in any bad metric (fatals, warnings, timeouts, page load time, etc), until it 100%. (nb: probably just doing 5%, 10%, then 100% is good enough)

Problems

Subtle cache poisoning
- deploying directly to a small percentage of wikipedias could cause subtle undetected rendering bugs to poison cache for a long time before anyone notices
Which MW version should be used to run a maintenance script? Right now that can be determined by mwscript's --wiki argument plus wikiversions.json.

Ideas

Probably a good idea to deploy smaller batches of commits more frequently
- Changes to weekly branching to make this easier
- Elimination/Bypassing of deploy groups

deploy to canary servers
- Canary server run newest version for all wikis -- no cache?
- Different canary servers than are used for SWAT
  - Would have to not serve prod traffic except for group0 (maybe dedicated group0)?
- Opt-in beta testing to a group of servers via cookie or header
- Automated testing

Get rid of group0
- Replace with canary servers + tests
- Rollout gradual to group1 when tests pass

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		None	T49437 Consider a pipeline for enhanced minification (e.g. support UglifyJS)
Open		None	T104398 Deploy MW+Extensions by percentage of users (instead of by domain/wiki)
Declined		• mmodell	T89945 Merge to deployed branches instead of cutting a new deployment branch every week.
Invalid		None	T51392 Make make-wmf-branch able to branch extensions with replaced substring of the version of mediawiki being branched
Resolved		• mmodell	T67306 Adopt Semantic Versioning format for WMF deploy branches beginning with 1.27.0-wmf.1
Resolved		Jdforrester-WMF	T107192 Update ReleaseTaggerBot to deal with SemVer for WMF deployed branches (eg 1.23.0-wmf.6)
Declined		• mmodell	T136015 thoroughly document the new branch cutting plan / strategy
Resolved		• mmodell	T142880 Create `scap swat` command to automate patch merging & testing during a swat deployment
Resolved		• mmodell	T140918 create `scap branch` command (the successor to make-wmf-branch)
Resolved		• mmodell	T142590 make scap3 look in PWD to find local CLI extensions
Resolved		dduvall	T140921 Reduce static asset time on disk from five trains' worth to two
Resolved		Krinkle	T102578 Don't trash cache for front-end resources
Resolved		Krinkle	T99096 Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime
Resolved		• mmodell	T102991 Verify traffic to static resources from past branches does indeed drain
Duplicate		Krinkle	T98087 ResourceLoader module version must only change when effective output would change
Resolved		Krinkle	T104950 Make FileModule version hash deterministic
Resolved		Krinkle	T94810 User modules constantly invalidate their cache timestamp
Resolved	PRODUCTION ERROR	MaxSem	T90411 (4 hrs) ResourceLoader timestamp for mobile.usermodule changes constantly
Resolved		Krinkle	T94074 Refactor ResourceLoader versioning system to use hashes instead of timestamps
Resolved		Krinkle	T111481 Fix intermittend ghost entries in FileModule 'fileHashes' data
Resolved		Krinkle	T113868 File dependency tracking unstable (varies by language)
Resolved		Krinkle	T109394 ResourceLoaderModuleTest::testGetVersionHash is flaky
Resolved		Krinkle	T113092 Revise the design of ResourceLoader's MessageBlobStore

Event Timeline

greg created this task.Jun 30 2015, 10:51 PM

greg raised the priority of this task from to Needs Triage.

greg updated the task description. (Show Details)

greg added a project: Deployments.

greg subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2015, 10:51 PM

Legoktm subscribed.Jun 30 2015, 11:05 PM

this seems epic, not sure how actionable this is at the moment, backlogging

@greg: when you get back, feel free to re-triage this if you feel it's higher priority

• mmodell moved this task from To Triage to Backlog (Tech) on the Deployments board.Jul 6 2015, 6:38 PM

Normal prio is fine, we should just be able to give an answer (with our hands waving a bit) when asked "what's needed to do this?" Timeline to have a hand-wavy answer: end of quarter-ish? I foresee tons of people asking for this in the next year.

The multiversion router could certainly be configured to key off of something other than the host: header. It would not take very much effort to implement, although the code in multiversion may contain live bobcats, or perhaps it's more probably than maybe.

Some possible criteria for routing a given request to a given branch:

a hash of the client IP address.
a cookie which routes users to the unstable branch, we could direct volunteer testers to a unique url that sets the cookie, and ask them be on the lookout for problems.
We could (and probably should) send all wikimedia staff to the unstable branch when they are logged in.

There are some problems with sending random requests to the unstable branch, especially anonymous request, do to our aggressive cache layer. If we cache a broken page then it will potentially leak beyond our 'small sample group,' and additionally, the brokenness could persist in cache for quite a long time.

On IRC, @BBlack suggested that we might consider lowering our cache TTL to something like 2 weeks instead of 30 days, coinciding with changes to the release process. I'm sure there are other things we could do to mitigate the problems but I don't know of any brilliant solutions right now.

• mmodell added a subtask: T89945: Merge to deployed branches instead of cutting a new deployment branch every week..Jul 16 2015, 9:05 AM

In T104398#1456131, @mmodell wrote:

On IRC, @BBlack suggested that we might consider lowering our cache TTL to something like 2 weeks instead of 30 days, coinciding with changes to the release process. I'm sure there are other things we could do to mitigate the problems but I don't know of any brilliant solutions right now.

I suspect it's possible we can drop our maximum TTL without a huge loss in cache hitrate, but that's just random speculation without data. It's entirely possible that doing so would cause an unacceptable plunge in hitrate. The whole reason that I brought that up was in the context of the idea of serial interoperability as a way to solve several related problems with deployments.

The general idea is that you'd pick a count of serial releases as a rolling interoperability window. For N=5, no change in a new release can break basic compatibility with any of the previous 4. Another way to think of that: it should be perfectly ok to run the cluster with all 5 of the most recent releases at 20% each. We might not want to do that because it would be confusing for users to see 5 versions of a feature, but fundamentally nothing would be broken for the site or the users. When there's a desire to make a breaking change (for example, deploying new javascript code to browsers which only works with matching new server-side code), one has to plan the phase-in of that feature such that the serial compatibility window is maintained.

Even if we only strictly forced this rule for N=2, it greatly improves the reliability of the deployment process at the appserver level (fast cluster-wide deployment to paper over incompatibilities no longer matters...), and our ability to cleanly roll backwards when necessary as well. But if we extend the window a bit, we can solve similar cache-level problems with it as well.

I'm oversimplifying by ignoring the dimension of the problem where we already run 3 releases for different wikis here, but: for instance, if we deploy 3 new versions every week, and we maintain serial compatibility for a window of 6 releases, then there can never be a compatibility issue caused by cached content with a lifetime of 2 weeks in our caches. That's the sort of context in which I think it's worth investigating the impact of cache lifetime reduction. If reducing from 30d to 14d only causes a small hitrate loss, but lowers the serial compatibility window from "unreasonable" to "practical", it might be worth it.

@BBlack: I think we should do essentially what you suggest, but instead of creating new branches each week we just have 2 or 3 long-lived branches and merge changes into them sequentially, e.g.

Monday:
- we accept proposed changes into the release tree (and only accept changes that are intended to maintain the compatibility window, as you described above).

git co wmf/group0
git merge $commit
git merge $commit
# (or git merge master, if we wanna be crazy)
# then... run thorough tests monday night.

Tuesday:
- If everything looks good, we create a new tag and push it to some percentage of users.
- The groups can be defined either by using the current host-based grouping (group0.dblist, group1.dblist, etc) or some new groupings based on client IP, rand(), ...

# on tuesday, 
# 
git co wmf/group0
TAG=wmf/group0/{date or sequence number}
git tag -a $TAG
sync $TAG

Wednesday:
- Assuming things still look stable, we accept all the commits from group0 into group1
- At this point we can start merging more changes into group0 (SWAT 2.0?)

git co wmf/group1
git merge group0
TAG=wmf/group1/{date or sequence number}
git tag -a $TAG
sync $TAG

Repeat up to as many parallel branches as we want to maintain, I think 2 would be a good start, maybe 3.

• mmodell mentioned this in T89945: Merge to deployed branches instead of cutting a new deployment branch every week..Jul 16 2015, 4:49 PM

greg mentioned this in T111597: Devise means for experimental software working with live data.Sep 8 2015, 11:41 PM

https://etherpad.wikimedia.org/p/rolling-deployments

thcipriani renamed this task from Investigate what changes are needed to deploy MW+Extensions by percentage of users (instead of by domain/wiki) to Deploy MW+Extensions by percentage of users (instead of by domain/wiki).Dec 21 2016, 6:10 PM

I think the main points from a recent discussion were: in order to deploy to a percentage of traffic we need to do so quickly. In order to be able to move quickly we need to spot subtle cache-poisoning problems before they enter into a deployment pipeline.

• mmodell closed subtask T89945: Merge to deployed branches instead of cutting a new deployment branch every week. as Declined.Feb 14 2017, 2:02 PM

• mmodell updated the task description. (Show Details)Feb 14 2017, 2:08 PM

Seb35 subscribed.Oct 7 2017, 4:03 PM

dancy subscribed.Mar 25 2021, 7:03 PM

dancy updated the task description. (Show Details)Apr 6 2021, 2:49 PM

Krinkle subscribed.Apr 7 2021, 8:16 PM