Page MenuHomePhabricator

Deploy MW+Extensions by percentage of users (instead of by domain/wiki)
Open, MediumPublic

Description

(I'm surprised I can't find a task about this already since we've talked about this forever; if I just duplicated one, let me know.)

As a deployer I want to release new code to a small subset of users across all wiki projects (eg 5%) and gradually increase that, watching for an increase in any bad metric (fatals, warnings, timeouts, page load time, etc), until it 100%. (nb: probably just doing 5%, 10%, then 100% is good enough)


Problems

  • Subtle cache poisoning
    • deploying directly to a small percentage of wikipedias could cause subtle undetected rendering bugs to poison cache for a long time before anyone notices
  • Which MW version should be used to run a maintenance script? Right now that can be determined by mwscript's --wiki argument plus wikiversions.json.

Ideas

  • Probably a good idea to deploy smaller batches of commits more frequently
    • Changes to weekly branching to make this easier
    • Elimination/Bypassing of deploy groups
  • deploy to canary servers
    • Canary server run newest version for all wikis -- no cache?
    • Different canary servers than are used for SWAT
      • Would have to not serve prod traffic except for group0 (maybe dedicated group0)?
    • Opt-in beta testing to a group of servers via cookie or header
    • Automated testing
  • Get rid of group0
    • Replace with canary servers + tests
    • Rollout gradual to group1 when tests pass

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
OpenNone
Declined mmodell
InvalidNone
Resolved mmodell
ResolvedJdforrester-WMF
Declined mmodell
Resolved mmodell
Resolved mmodell
Resolved mmodell
Resolveddduvall
ResolvedKrinkle
ResolvedKrinkle
Resolved mmodell
DuplicateKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedPRODUCTION ERRORMaxSem
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle
ResolvedKrinkle

Event Timeline

greg raised the priority of this task from to Needs Triage.
greg updated the task description. (Show Details)
greg added a project: Deployments.
greg subscribed.
mmodell triaged this task as Medium priority.Jul 6 2015, 6:37 PM
mmodell subscribed.

this seems epic, not sure how actionable this is at the moment, backlogging

@greg: when you get back, feel free to re-triage this if you feel it's higher priority

Normal prio is fine, we should just be able to give an answer (with our hands waving a bit) when asked "what's needed to do this?" Timeline to have a hand-wavy answer: end of quarter-ish? I foresee tons of people asking for this in the next year.

The multiversion router could certainly be configured to key off of something other than the host: header. It would not take very much effort to implement, although the code in multiversion may contain live bobcats, or perhaps it's more probably than maybe.

Some possible criteria for routing a given request to a given branch:

  • a hash of the client IP address.
  • a cookie which routes users to the unstable branch, we could direct volunteer testers to a unique url that sets the cookie, and ask them be on the lookout for problems.
  • We could (and probably should) send all wikimedia staff to the unstable branch when they are logged in.

There are some problems with sending random requests to the unstable branch, especially anonymous request, do to our aggressive cache layer. If we cache a broken page then it will potentially leak beyond our 'small sample group,' and additionally, the brokenness could persist in cache for quite a long time.

On IRC, @BBlack suggested that we might consider lowering our cache TTL to something like 2 weeks instead of 30 days, coinciding with changes to the release process. I'm sure there are other things we could do to mitigate the problems but I don't know of any brilliant solutions right now.

On IRC, @BBlack suggested that we might consider lowering our cache TTL to something like 2 weeks instead of 30 days, coinciding with changes to the release process. I'm sure there are other things we could do to mitigate the problems but I don't know of any brilliant solutions right now.

I suspect it's possible we can drop our maximum TTL without a huge loss in cache hitrate, but that's just random speculation without data. It's entirely possible that doing so would cause an unacceptable plunge in hitrate. The whole reason that I brought that up was in the context of the idea of serial interoperability as a way to solve several related problems with deployments.

The general idea is that you'd pick a count of serial releases as a rolling interoperability window. For N=5, no change in a new release can break basic compatibility with any of the previous 4. Another way to think of that: it should be perfectly ok to run the cluster with all 5 of the most recent releases at 20% each. We might not want to do that because it would be confusing for users to see 5 versions of a feature, but fundamentally nothing would be broken for the site or the users. When there's a desire to make a breaking change (for example, deploying new javascript code to browsers which only works with matching new server-side code), one has to plan the phase-in of that feature such that the serial compatibility window is maintained.

Even if we only strictly forced this rule for N=2, it greatly improves the reliability of the deployment process at the appserver level (fast cluster-wide deployment to paper over incompatibilities no longer matters...), and our ability to cleanly roll backwards when necessary as well. But if we extend the window a bit, we can solve similar cache-level problems with it as well.

I'm oversimplifying by ignoring the dimension of the problem where we already run 3 releases for different wikis here, but: for instance, if we deploy 3 new versions every week, and we maintain serial compatibility for a window of 6 releases, then there can never be a compatibility issue caused by cached content with a lifetime of 2 weeks in our caches. That's the sort of context in which I think it's worth investigating the impact of cache lifetime reduction. If reducing from 30d to 14d only causes a small hitrate loss, but lowers the serial compatibility window from "unreasonable" to "practical", it might be worth it.

@BBlack: I think we should do essentially what you suggest, but instead of creating new branches each week we just have 2 or 3 long-lived branches and merge changes into them sequentially, e.g.

  • Monday:
    • we accept proposed changes into the release tree (and only accept changes that are intended to maintain the compatibility window, as you described above).
git co wmf/group0
git merge $commit
git merge $commit
# (or git merge master, if we wanna be crazy)
# then... run thorough tests monday night.
  • Tuesday:
    • If everything looks good, we create a new tag and push it to some percentage of users.
    • The groups can be defined either by using the current host-based grouping (group0.dblist, group1.dblist, etc) or some new groupings based on client IP, rand(), ...
# on tuesday, 
# 
git co wmf/group0
TAG=wmf/group0/{date or sequence number}
git tag -a $TAG
sync $TAG
  • Wednesday:
    • Assuming things still look stable, we accept all the commits from group0 into group1
    • At this point we can start merging more changes into group0 (SWAT 2.0?)
git co wmf/group1
git merge group0
TAG=wmf/group1/{date or sequence number}
git tag -a $TAG
sync $TAG

Repeat up to as many parallel branches as we want to maintain, I think 2 would be a good start, maybe 3.

thcipriani renamed this task from Investigate what changes are needed to deploy MW+Extensions by percentage of users (instead of by domain/wiki) to Deploy MW+Extensions by percentage of users (instead of by domain/wiki).Dec 21 2016, 6:10 PM

I think the main points from a recent discussion were: in order to deploy to a percentage of traffic we need to do so quickly. In order to be able to move quickly we need to spot subtle cache-poisoning problems before they enter into a deployment pipeline.