Page MenuHomePhabricator

Optimize prod's resource domains for SPDY/HTTP2
Closed, ResolvedPublic

Description

Now that SPDY has been enabled across our projects (T35890), it's time to start thinking about various optimizations we could do to take more advantage of it.

Right now, connecting to production means estabilishing a lot of different connections with different domains & IPs, which means RTTs wasted to make a TCP & TLS handshake and not taking advantage of SPDY's multiplexing.

It should be noted that:

  • SPDY has a feature called "IP pooling", in which UAs opportunistically multiplex over the same session requests to different domains that a) resolve to the same IP and b) are valid according to the TLS/X.509 certificate presented. The latter is not the case for projects vs. wikimedia.org and would need new certificates, with a non-trivial cost attached to them.
  • Completely undoing domain sharding will have an impact for non-SPDY clients, which would need to be factored in and taken into account as a tradeoff.

So far we have:

  1. bits (T95448) — we should probably fold bits into text. This can happen at least in the edge caching layer (which we'd probably want to do anyway). bits is also sort of a hack MediaWiki-wise with Varnish rewriting URLs, so we could possibly undo all of it with the caveat of undoing an optimization for non-SPDY UAs.
  2. upload — problematic for multiple reasons: at least security separation for third-party data and possibly Wikipedia Zero IP-based traffic separation. upload is also much heavier in traffic than text and the distinction is now an implicit load-balancing technique. A potential merge would present scalability problems with at least LVS & Varnish.
  3. login.wikimedia.org (for checkLoggedIn), meta.wikimedia.org (CentralNotice, gadgets) — theoretically good candidates for SPDY IP pooling, with the caveat of the certificate troubles mentioned above. Those two should already use the same SPDY connection now, but in quick tests I've done they do not appear to do so, although I see a common connection between login/commons. This needs further debugging.
  4. www.mediawiki.org, commons.wikimedia.org, <random projects> (gadgets) — already a performance issue, not much we can do about them; hopefully Gadgets 2.0 can.

Event Timeline

faidon raised the priority of this task from to Needs Triage.
faidon updated the task description. (Show Details)
faidon added subscribers: faidon, ori, BBlack.

I've been looking forward to folding bits back into text for a while now anyways. On many levels, it's an appropriate move at this point in time regardless of SPDY, IMHO. It reduces reduces varnish config and mgmt complexity (it's the only prod 1layer cluster layout), reduces cluster count, makes better use of current CPU/mem hardware (if we buy a few SSDs and fold bits machines back into other clusters), etc. It's also probably the easiest of all of these cases to solve for SPDY.

Also, I tend to agree that upload should be left alone for now as there are too many other considerations there. If we ever get down to just two connections for upload and everything else, then maybe we can revisit it at that time and decide whether it's worth messing with it. For now there's other lower-hanging fruit to go after.

Andrew triaged this task as Medium priority.Apr 6 2015, 4:30 PM
Andrew set Security to None.

Any pending tasks here or is this resolved?

Well this basically got solved along the way while doing other things. We've flipped back to using a unified cert that covers all the projects (including *.wp.o + *.wm.o), and so we get coalesce for all of the IPs that map to text-lb already, which includes bits now.

We're missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs, but that's a trickier problem and we're not ready to move on any solution there. Arguably (a) it can wait and it's not that critical and (b) solving it too early hurts perf for non-SPDY clients, too. I think we can go ahead and close this ticket in favor of a possible new one about eventually looking at the specific upload problem.

I should note that mobile probably still has some coalescing to gain, but it's not the driver for those solutions anyways, which will get addressed as part of: T109286

Dzahn claimed this task.
Dzahn subscribed.

Well this basically got solved along the way while doing other things. ... I think we can go ahead and close this ticket in favor of a possible new one about eventually looking at the specific upload problem.

new one about eventually looking at the specific upload problem.

T116132