Page MenuHomePhabricator

Support HTTP/2
Closed, ResolvedPublic

Description

[Note the ticket has morphed a bit, and older comments at the top may not be very relevant anymore!]

We should support HTTP2, as this is the true standard that replaces the experimental SPDY and brings mostly the same benefits we've enjoyed so far with SPDY. Note that our current production TLS terminators already support ALPN w/ SPDY, so ALPN-vs-NPN isn't an issue anymore.

Our current termination software is nginx 1.9.4 with a few extra local patches in support of parallel RSA+ECDSA certs. Upstream nginx introduced HTTP2 support with their release of version 1.9.5. However, they made an implementation decision to replace SPDY/3 support with HTTP2 support, rather than support both protocols side-by-side. We believe that was a very poor decision, as it would have been easy to support both in the patches, and client statistics indicate the real world still has a fair number of clients out there which are SPDY- but not HTTP2- capable.

Our preliminary stats from sampling live TLS ClientHello data from our traffic (this was a fairly small sample, but I wouldn't expect too much change from a larger one) were:

none48.312%
spdy+h225.426%
spdy-only24.913%
h2-only01.350%

On a practical level, SPDY/3 and HTTP2 are both doing the same job in terms of effects on client and server performance and such. If the stats above are broadly-accurate and we dropped SPDY/3 support for HTTP2 support today, we'd fall from ~50% of client connections on SPDY to only ~27% of client connections on HTTP2.

For an extra kick, this now also blocks us from upgrading to nginx 1.9.5 or higher in general, should we want to do so to apply fixes and improvements to other unrelated things. We have a few basic categorical options here:

  1. Re-work nginx's HTTP2 patch such that it doesn't remove SPDY/3 support, and then sort out making that into a reasonable diff against current nginx 1.9.x code. Not completely trivial, but not all that difficult either given the patch history / diffs available at http://nginx.org/patches/http2/ . Kind of awful in the sense of moving further away from upstream and having to deal with more local code maintenance burden.
  1. Convince upstream nginx to do something similar on their own.
  1. Do all the work of option 1 ourselves, and submit the patches and get them included in upstream, removing the latter half of the problem in option 1.
  1. Move to a different TLS termination software altogether, which ideally supports SPDY/3 + HTTP2, or at least supports SPDY/3 and has future plans to introduce HTTP2 alongside it (so that at least we can continue tracking upstream on unrelated bugs/improvements, unlike the situation with nginx today). Apache might be an option here, but there are probably others that fit the bill as well. Some googling and evaluating is in order.

Details

Related Gerrit Patches:
operations/puppet : productionremove do_spdy hieradata, all h2, 2/2
operations/puppet : productionremove do_spdy conditional, all h2, 1/2
operations/puppet : productioncache_text HTTP/2 switch
operations/puppet : productioncache_upload HTTP/2 switch
operations/software/nginx : wmf-1.10.0-1nginx (1.10.0-1+wmf1) jessie-wikimedia; urgency=medium
operations/software/nginx : wmf-1.10.0-1Remove --automatic-dbgsym on dyn mod dh_strip
operations/software/nginx : wmf-1.10.0-1multicert + libssl1.0.2 patches for 1.10.0
operations/software/nginx : wmf-1.9.14-1multicert + libssl1.0.2 patches for 1.9.15
operations/software/nginx : wmf-1.9.14-1Remove --automatic-dbgsym on dyn mod dh_strip
operations/software/nginx : wmf-1.9.14-1nginx (1.9.15-1+wmf1) jessie-wikimedia; urgency=medium
operations/software/nginx : wmf-1.9.12-1nginx (1.9.12-1+wmf1) jessie-wikimedia; urgency=medium
operations/software/nginx : wmf-1.9.12-1multicert + libssl1.0.2 patches for 1.9.12
operations/software/nginx : wmf-1.9.12-1Import Upstream version 1.9.12
operations/puppet : productionconfigure cp1008 for http2
operations/puppet : productiontlsproxy: use do_spdy to control http2-vs-spdy
operations/puppet : productionAdd h2_spdy_stats.stp
operations/software/nginx : wmf-1.9.3-1-h2HTTP/2 alpha patch v2

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
BBlack added a comment.EditedMar 24 2016, 1:20 PM

We still don't have a broad, long sample, but using @ema's new systemtap stuff (which is way better than the sniffer-based solution), a 10-minute sample on cp1048 just now (upload eqiad) gives:

total: 141066
both: 54089
npn_http1: 17949
alpn_spdy: 6308
npn_spdy: 4112
h2: 2196

Which makes a box like the earlier ones that looks like:

ProtocolPercentage
h2 only1.6%
spdy3 only7.4%
spdy3+h238.3%
none52.7%

Given different time of day and different continent, we'd expect some variation from previous runs. But still, it's hard not to argue that some browsers out there recently dumped spdy3 for http2-only and/or http1-only. Turning the above into a net tradeoff: with only spdy3 server-side, we get 45.7% spdy3, and with only http2 server-side, we get 39.9% http2. I'd say we're in striking range, and we should at least get an updated http2-only nginx package ready soon so that it can be deployed easily. We could target sometime in April.

On looking into the spdy/3 stats drop, my suspicion is something has changed with IE11 (that it has dropped SPDY/3 for http/[12]), but I still haven't found a good source confirming this.

The basic plan for moving forward is to start from debian's current 1.9.11 packaging, update it to 1.9.12 (because .12 upstreams our temporary fix for openssl "shutdown while in init" log spam, and contains an http/2 socket leak bugfix), and forward-port our multi-cert patches from our custom 1.9.4 package to a new custom 1.9.12 package. I'm hoping we'll have the package ready for testing on pinkunicorn by mid next week if the forward-porting goes smoothly, or the week after if not. If functional testing works out, we'll then be in a position to deploy this when we decide it's appropriate to make the switch. We still do need to take some broader and longer samples with the systemtap method to confirm overall stats.

1.9.12 porting/packaging was fairly trivial so far. Testing on cp1008/pinkunicorn to happen later today or tomorrow.

Change 279989 had a related patch set uploaded (by BBlack):
Import Upstream version 1.9.12

https://gerrit.wikimedia.org/r/279989

Change 279990 had a related patch set uploaded (by BBlack):
multicert libssl1.0.2 patches for 1.9.12

https://gerrit.wikimedia.org/r/279990

Change 279991 had a related patch set uploaded (by BBlack):
nginx (1.9.12-1 wmf1) jessie-wikimedia; urgency=medium

https://gerrit.wikimedia.org/r/279991

Mentioned in SAL [2016-03-28T21:26:41Z] <bblack> http/2 enabled on pinkunicorn.wikimedia.org for testing - T96848

Change 280148 had a related patch set uploaded (by BBlack):
tlsproxy: use do_spdy to control http2-vs-spdy

https://gerrit.wikimedia.org/r/280148

Change 280149 had a related patch set uploaded (by BBlack):
configure cp1008 for http2

https://gerrit.wikimedia.org/r/280149

Change 280148 merged by BBlack:
tlsproxy: use do_spdy to control http2-vs-spdy

https://gerrit.wikimedia.org/r/280148

Change 280149 merged by BBlack:
configure cp1008 for http2

https://gerrit.wikimedia.org/r/280149

ema moved this task from Triage to Up Next on the Traffic board.Apr 1 2016, 1:16 PM
ema added a comment.Apr 4 2016, 8:25 AM

Mostly out of curiosity, I've checked which protocols are supported by other top-10 websites by looking at NPN responses:

google.com / youtube.comh2, spdy/3.1, http/1.1
facebook.comh2, h2-14, spdy/3.1-fb-0.5, spdy/3.1, spdy/3, http/1.1
baidu.comhttp/1.1
yahoo.comh2, h2-14, spdy/3.1, spdy/3, http/1.1, http/1.0
amazon.comhttp/1.1
twitter.comh2, spdy/3.1, http/1.1
BBlack added a comment.Apr 5 2016, 2:17 PM

re: nginx upstream+debian: debian's "master" branch is still at 1.9.10-1, but their "dyn" branch has work beyond that up through 1.9.13 and not yet released, which also re-structures packaging and whatnot for dynamic modules. Need to look into whether this plays well as an upgrade on jessie or not.

ori added a comment.EditedApr 12 2016, 5:41 PM

Coloring in some additional details. I noticed a regression in first paint time over the past three months and found a correlated slump in the percent of client connections using SPDY. There appears to be a major slump in SPDY penetration in early January, followed by a slow, partial recovery, followed by an additional slump in early April.


(If you want to examine the graph in Graphite, you can import its URL).

I am not able to associate these slumps with a change to the suite of supported protocols in a major browser. What more, the first paint regression which correlates to the slumps in SPDY penetration applies to both desktop and mobile. This suggests to me that they were not caused by a change in client behavior, but instead a change to our infrastructure.

Making the above analysis harder for anyone else looking: that's not a percentage of client connections using SPDY, it's a percentage of client requests using SPDY. There are many unknown variables on how many requests occur per connection, some of which are directly correlated with whether a client is using SPDY and some of which are not...

Notable: there's an ongoing report of 1.9.14 causing an HTTP/2 proto error in Chrome. We may need to be wary and stick with .13 or wait for .15: http://mailman.nginx.org/pipermail/nginx-devel/2016-April/008143.html

ori added a comment.Apr 15 2016, 7:13 PM

Notable: there's an ongoing report of 1.9.14 causing an HTTP/2 proto error in Chrome. We may need to be wary and stick with .13 or wait for .15: http://mailman.nginx.org/pipermail/nginx-devel/2016-April/008143.html

It looks like the problem is with Chrome, not Nginx, though a workaround has been committed to the Nginx source tree.

https://bugs.chromium.org/p/chromium/issues/detail?id=603182

Change 279989 abandoned by BBlack:
Import Upstream version 1.9.12

https://gerrit.wikimedia.org/r/279989

Change 279990 abandoned by BBlack:
multicert libssl1.0.2 patches for 1.9.12

https://gerrit.wikimedia.org/r/279990

Change 279991 abandoned by BBlack:
nginx (1.9.12-1 wmf1) jessie-wikimedia; urgency=medium

https://gerrit.wikimedia.org/r/279991

Change 284075 had a related patch set uploaded (by BBlack):
multicert libssl1.0.2 patches for 1.9.14

https://gerrit.wikimedia.org/r/284075

Change 284077 had a related patch set uploaded (by BBlack):
nginx (1.9.14-1 wmf1) jessie-wikimedia; urgency=medium

https://gerrit.wikimedia.org/r/284077

^ The 1.9.14 commits have the chrome http/2 fix in place, too. I haven't built or tested these yet (pinkunicorn still on the 1.9.12 patches).

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 1:59 PM

Nginx 1.9.15 Changes:

*) Bugfix: "recv() failed" errors might occur when using HHVM as a
   FastCGI server.

*) Bugfix: when using HTTP/2 and the "limit_req" or "auth_request"
   directives a timeout or a "client violated flow control" error might
   occur while reading client request body; the bug had appeared in
   1.9.14.

*) Workaround: a response might not be shown by some browsers if HTTP/2
   was used and client request body was not fully read; the bug had
   appeared in 1.9.14.

*) Bugfix: connections might hang when using the "aio threads"
   directive.
   Thanks to Mindaugas Rasiukevicius.

Probably we'll see a lot of HTTP/2 fixes from now on since the protocol's implementation is really new, but worth to mention anyway :)

I backported some of that to our tentative 1.9.14-1+wmf1, but yeah we'll want the rest of the HTTP/2 fixes that have landed since in 1.9.15. Note debian now has their git repo pushing a 1.9.14-1 to unstable (which means the 'dyn' work for nginx dso stuff is done), and the current 1.9.14-1+wmf1 of ours is rebased onto that. There's some testing to do about how the new packages will work on jessie with some modules split out, etc.

Change 284920 had a related patch set uploaded (by BBlack):
Remove --automatic-dbgsym on dyn mod dh_strip

https://gerrit.wikimedia.org/r/284920

Packaging patches updated to 1.9.15-1+wmf1 (which is still in branch wmf-1.9.14-1, as we're still based on that release from debian upstream unstable/testing, and then includes the full 1.9.14..1.9.15 diffs from upstream as an additional local commit). Builds correctly, installed locally on pinkunicorn for testing, seems to work!

ori added a comment.Apr 22 2016, 5:18 PM

[...] but using @ema's new systemtap stuff (which is way better than the sniffer-based solution) [...]

Is this code published anywhere?

In T96848#2230942, @ori wrote:

[...] but using @ema's new systemtap stuff (which is way better than the sniffer-based solution) [...]

Is this code published anywhere?

The systemtap script is at https://github.com/wikimedia/operations-puppet/blob/production/modules/tlsproxy/files/utils/h2_spdy_stats.stp - we were semi-blocked on "finish upgrading the kernels on the caches so we can deploy the same systemtap compiled blob to them all and grab stats", and then it turned out for other blocker reasons we're not going to finish those kernel upgrades quickly. Probably we'll have to build for both kernels and handle running it on both too for taking the extended snapshots. That stuff's all to resume next week, at least that's the plan.

ori added a comment.Apr 22 2016, 11:46 PM
In T96848#2230942, @ori wrote:

[...] but using @ema's new systemtap stuff (which is way better than the sniffer-based solution) [...]

Is this code published anywhere?

The systemtap script is at https://github.com/wikimedia/operations-puppet/blob/production/modules/tlsproxy/files/utils/h2_spdy_stats.stp - we were semi-blocked on "finish upgrading the kernels on the caches so we can deploy the same systemtap compiled blob to them all and grab stats", and then it turned out for other blocker reasons we're not going to finish those kernel upgrades quickly. Probably we'll have to build for both kernels and handle running it on both too for taking the extended snapshots. That stuff's all to resume next week, at least that's the plan.

Ah, OK. I asked because I was interested in helping out, but it sounds like you are on top of this and that I would only be getting in the way. If there is something you'd like my help with, let me know.

If you have time and want to do it (next week!), by all means go for it, I have lots else to keep me busy indefinitely :) My basic plan was try it for an hour on a couple of caches just as trial run (sane stats output makes sense, nothing crashes, no strange machine perf impact), then run it across text+upload for ~24h and sum the outputs up as first true full sample. Maybe take another shot a week later on a different DOW to confirm things are similar-ish and/or moving in the right direction, as a final step before upgrading?

ori added a comment.Apr 26 2016, 9:25 PM

If you have time and want to do it (next week!), by all means go for it, I have lots else to keep me busy indefinitely :) My basic plan was try it for an hour on a couple of caches just as trial run (sane stats output makes sense, nothing crashes, no strange machine perf impact), then run it across text+upload for ~24h and sum the outputs up as first true full sample. Maybe take another shot a week later on a different DOW to confirm things are similar-ish and/or moving in the right direction, as a final step before upgrading?

I don't have time after all, since I'm in the process of moving. But the plan you sketch out sounds good to me.

So, I had intended to do the quick test and start the 24H test today, but I've run into some issues. Running the kernel object with staprun caused cp1065 to get into some crazy state on my first manual test. My ssh session stayed open, but cp1065 basically stopped responding to icinga checks and public traffic, and the background stapio process became unkillable too. I ended up rebooting (see also T131961#2245145 ). I may try another test with cp1008 downgraded onto the 3.19.0-2 kernel and see how that flies, but there's no way I'm running this on the fleet till I understand the issue.

Continuing on the saga above: building the ko for various kernels either doesn't work at all or requires different versions of the systemtap building tools than what's available in jessie. The one I built for 3.19, I had to hack the systemtap sources with a patch from https://sourceware.org/ml/systemtap/2014-q4/msg00257.html , and that one still caused harm on cp1065 in the end.

What I do have that works on some hosts, is the last binary @ema built, which appears to only load and work on our previous 4.4.0-1-amd64 kernel (4.4.0-1-amd64 #1 SMP Debian 4.4-1~wmf1 (2016-01-26)), not the newer 4.4 kernel and not the older 3.19 kernel.

The prod hosts we have now that are capable of running this object (upgraded to the first 4.4 test kernel, but not the latest 4.4 kernel) are:
cp1067 - eqiad text
cp1071 - eqiad upload
cp3048 - esams upload
cp4006 - ulsfo upload

Probably the best reasonable/conservative plan at this point is to gather stats on just these and process it both as an overall total and include the per-host results, so we can see any regional diff in upload, and how eqiad-text differs from eqiad-upload. Within each site+cluster, using a single host is a reasonable sub-sample, as it should be pseudo-random which IPs land on which hosts via source hashing.

And... the output looks like that object was from a previous version of the source, the one in P2719 which lacks detection of an ALPN negotiation which specifies SPDY but not HTTP/2.

Searched all the cache nodes for other builds that might have been left behind. Found a working compile of the latest source for 3.19.0-2 in @ema's homedir on cp1048 (which he probably mentioned to me on IRC a few weeks ago!). Tested and doesn't seem crashy and has the right outputs. I'm still in a more-conservative mindset, though, so I think I'll just select 1x host per site+cluster combination to do the sampling on (8 total), out of the much larger set still running 3.19.

So, target hosts now:

text:
cp1053
cp2001
cp3030
cp4009

upload:
cp1048
cp2002
cp3034
cp4005

Mentioned in SAL [2016-04-28T14:25:59Z] <bblack> started SPDY stats sample on 8x caches - T96848#2248582

The initial 1-hour run is done, and there didn't seem to be any adverse effects.

For the record, this is how the raw results format looks per-host:

total: 3603944
both: 1099183
npn_http1: 252756
h2: 230561
alpn_spdy: 135580
npn_spdy: 98353

And the way we interpret the raw fields is:

  • total - total count of TLS reqs
  • sum(everything_else) - count of TLS reqs that sent npn and/or alpn at all, and each stat is mutually exclusive of the others
  • both - ALPN indicates H/2+SPDY support
  • h2 - ALPN indicates H/2 (but not SPDY) support
  • alpn_spdy - ALPN indicates SPDY (but not H/2) support
  • npn_spdy - NPN negotiated SPDY
  • npn_http1 - NPN negotiated H/1

So the way we summarize this into useful categories:

  • total - (both + h2 + alpn_spdy + npn_spdy) = H/1 -only client
  • both = H/2 + SPDY client
  • h2 = H/2 -only client
  • alpn_spdy + npn_spdy = SPDY -only client

For posterity, this is the raw input data from the systemtap outputs:

bblack-mba:results bblack$ for x in data/*; do echo == $x ==; cat $x; done
== data/codfw-text ==
npn_spdy: 11503
total: 352108
both: 132412
npn_http1: 16838
h2: 27524
alpn_spdy: 15348
== data/codfw-upload ==
total: 515998
both: 213510
h2: 10967
alpn_spdy: 13434
npn_spdy: 9492
npn_http1: 13194
== data/eqiad-text ==
total: 2371468
both: 695392
npn_http1: 208858
alpn_spdy: 64813
h2: 108680
npn_spdy: 57747
== data/eqiad-upload ==
total: 1776024
both: 803344
npn_http1: 151205
npn_spdy: 40178
h2: 34020
alpn_spdy: 57187
== data/esams-text ==
total: 3603944
both: 1099183
npn_http1: 252756
h2: 230561
alpn_spdy: 135580
npn_spdy: 98353
== data/esams-upload ==
total: 2601065
both: 1236225
npn_spdy: 81747
alpn_spdy: 140342
h2: 92966
npn_http1: 204499
== data/ulsfo-text ==
total: 3781743
both: 906763
h2: 247087
npn_spdy: 118687
npn_http1: 414406
alpn_spdy: 166198
== data/ulsfo-upload ==
total: 2360306
h2: 59818
both: 687778
npn_http1: 204378
alpn_spdy: 91328
npn_spdy: 66768

And this is the script I'm using to process them:

bblack-mba:results bblack$ cat h2-proc.pl
#!/usr/bin/perl -w

use strict;

my %raw;
while(<>) {
    my ($k, $v) = split(/:\s*/, $_);
    $raw{$k} += $v;
}

my %out;
$out{'h1'} = ($raw{'total'} - ($raw{'both'} + $raw{'h2'} + $raw{'alpn_spdy'} + $raw{'npn_spdy'})) / $raw{'total'};
$out{'both'} = $raw{'both'} / $raw{'total'};
$out{'h2'} = $raw{'h2'} / $raw{'total'};
$out{'spdy'} = ($raw{'alpn_spdy'} + $raw{'npn_spdy'}) / $raw{'total'};

foreach my $k (sort { $out{$b} <=> $out{$a} } keys %out) {
    printf("| %s | %.02f%% |\n", $k, 100 * $out{$k});
}

And this is the results broken down by-cluster (for all DCs combined) and by-DC (for both clusters combined):

All:

bblack-mba:results bblack$ cat data/* | ./h2-proc.pl

h155.34%
both33.26%
spdy6.73%
h24.67%

Text:

bblack-mba:results bblack$ cat data/*-text | ./h2-proc.pl

h159.29%
both28.03%
spdy6.61%
h26.07%

Upload:

bblack-mba:results bblack$ cat data/*-upload | ./h2-proc.pl

h149.83%
both40.54%
spdy6.90%
h22.73%

Eqiad:

bblack-mba:results bblack$ cat data/eqiad-* | ./h2-proc.pl

h155.12%
both36.14%
spdy5.30%
h23.44%

Codfw:

bblack-mba:results bblack$ cat data/codfw-* | ./h2-proc.pl

h149.98%
both39.85%
spdy5.73%
h24.43%

Ulsfo:

bblack-mba:results bblack$ cat data/ulsfo-* | ./h2-proc.pl

h161.83%
both25.96%
spdy7.21%
h25.00%

Esams:

bblack-mba:results bblack$ cat data/esams-* | ./h2-proc.pl

h149.80%
both37.64%
spdy7.35%
h25.21%

In the overall 1H test data, the net result is that when we make the switch:

  • ~33% of our client connections will upgrade from SPDY to H/2 (which is a very minor improvement)
  • ~7% of our client connections will revert from SPDY to H/1
  • ~5% of our client connections will upgrade from H/1 to H/2

I'd say this looks like a fine tradeoff to me. 2% net drop back to H/1, vs moving forward on standards and slightly-improving things for most clients (and increasingly more as the deprecated SPDY phases out and more clients move to H/2 -capable browsers).

Will report back again after 24H data is in.

(FTR: 24H sample started at 16:17 UTC, but stashbot didn't log it here)

I'd say this looks like a fine tradeoff to me. 2% net drop back to H/1, vs moving forward on standards and slightly-improving things for most clients (and increasingly more as the deprecated SPDY phases out and more clients move to H/2 -capable browsers).

Fully agreed.

ori added a comment.Apr 28 2016, 5:44 PM

I'd say this looks like a fine tradeoff to me. 2% net drop back to H/1, vs moving forward on standards and slightly-improving things for most clients (and increasingly more as the deprecated SPDY phases out and more clients move to H/2 -capable browsers).

Fully agreed.

+1

BBlack added a comment.EditedApr 29 2016, 4:40 PM

24H Results:

SetH/1BothSPDYH/2
All54.75%33.17%7.07%5.01%
Text57.58%28.72%6.99%6.72%
Upload50.92%39.20%7.17%2.71%
Eqiad50.79%38.80%6.08%4.33%
Codfw53.46%36.12%5.75%4.67%
Esams48.94%37.76%7.91%5.39%
Ulsfo61.93%25.77%7.15%5.15%

24H Raw Data:

bblack-mba:results bblack$ for x in data/*; do echo == $x ==; cat $x; done
== data/codfw-text ==
total: 7281449
both: 2926542
h2: 618950
npn_spdy: 281796
npn_http1: 401922
alpn_spdy: 269616
== data/codfw-upload ==
total: 10279854
both: 3416927
npn_spdy: 200915
alpn_spdy: 257386
npn_http1: 311183
h2: 201732
== data/eqiad-text ==
total: 40636664
npn_http1: 4094824
both: 13008303
alpn_spdy: 1192650
npn_spdy: 1148605
h2: 2319139
== data/eqiad-upload ==
total: 29285850
both: 14120203
alpn_spdy: 1076487
npn_http1: 2940853
npn_spdy: 834752
h2: 707442
== data/esams-text ==
total: 57042644
both: 17563745
alpn_spdy: 2296790
h2: 3973355
npn_http1: 4486148
npn_spdy: 1737478
== data/esams-upload ==
total: 41003094
both: 19462914
npn_http1: 3615906
npn_spdy: 1458771
alpn_spdy: 2259090
h2: 1313245
== data/ulsfo-text ==
total: 71180940
both: 17082379
npn_spdy: 2270869
npn_http1: 7046627
alpn_spdy: 3108626
h2: 4927614
== data/ulsfo-upload ==
total: 49756271
both: 14087558
npn_http1: 3611621
alpn_spdy: 1965456
npn_spdy: 1296549
h2: 1303043
BBlack added a comment.EditedApr 29 2016, 4:53 PM

While the 24H data is much better quality (not so subject to daily regional highs and lows), the overall picture is still basically the same.

There's a lot of interesting side- correlations and insights in this data too. Notably to me: we know from existing LVS stats that ulsfo has ballpark 50% of the traffic rate (in bytes) that esams has, yet when counting raw TLS connection-starts in this data, ulsfo has 23% more connections-per-day than esams. When we look at the protocol data, we also see that ulsfo skews more in the H/1 direction, probably due to a more-outdated mix of clients, since most of Asia maps there. So the disparity is probably indicative how much SPDY and H/2's connection-coalescing helps us in raw connection-count terms. I wasn't expecting that effect to be so big relative to the protocol-stats differential there...

Planning:

The next two work-weeks (May 2-6 and 9-13) are the last ones left before the Chrome SPDY cutoff on the 15th. At this point I'm tentatively planning to upgrade the nginxes for H/2 early next week, probably Tuesday May 3. This should give us some breathing room in case of any last minute complications or delays.

Change 286506 had a related patch set uploaded (by BBlack):
multicert libssl1.0.2 patches for 1.10.0

https://gerrit.wikimedia.org/r/286506

Change 286507 had a related patch set uploaded (by BBlack):
Remove --automatic-dbgsym on dyn mod dh_strip

https://gerrit.wikimedia.org/r/286507

Change 286508 had a related patch set uploaded (by BBlack):
nginx (1.10.0-1 wmf1) jessie-wikimedia; urgency=medium

https://gerrit.wikimedia.org/r/286508

Change 284077 abandoned by BBlack:
nginx (1.9.15-1 wmf1) jessie-wikimedia; urgency=medium

https://gerrit.wikimedia.org/r/284077

Change 284920 abandoned by BBlack:
Remove --automatic-dbgsym on dyn mod dh_strip

https://gerrit.wikimedia.org/r/284920

Change 284075 abandoned by BBlack:
multicert libssl1.0.2 patches for 1.9.15

https://gerrit.wikimedia.org/r/284075

Change 286506 merged by BBlack:
multicert libssl1.0.2 patches for 1.10.0

https://gerrit.wikimedia.org/r/286506

Change 286507 merged by BBlack:
Remove --automatic-dbgsym on dyn mod dh_strip

https://gerrit.wikimedia.org/r/286507

Change 286508 merged by BBlack:
nginx (1.10.0-1 wmf1) jessie-wikimedia; urgency=medium

https://gerrit.wikimedia.org/r/286508

BBlack added a comment.May 2 2016, 8:28 PM

I've built new 1.10.0-1+wmf1 packages and uploaded those to carbon and upgraded cp1008. These have no true code changes from the last 1.9.15-1+wmf1 test package, just version-related churn. Still planning to start upgrading clusters during the US daytime on Tues, May 3.

Mentioned in SAL [2016-05-03T17:11:07Z] <bblack> HTTP/2 enable for cache_maps (nginx upgrade) - T96848

Change 286700 had a related patch set uploaded (by BBlack):
cache_misc: HTTP/2 T96848

https://gerrit.wikimedia.org/r/286700

Change 286700 merged by BBlack:
cache_misc: HTTP/2 T96848

https://gerrit.wikimedia.org/r/286700

Mentioned in SAL [2016-05-03T18:17:31Z] <bblack> HTTP/2 enable for cache_misc (nginx upgrade - T96848)

Change 286816 had a related patch set uploaded (by BBlack):
cache_upload HTTP/2 switch

https://gerrit.wikimedia.org/r/286816

Change 286817 had a related patch set uploaded (by BBlack):
cache_text HTTP/2 switch

https://gerrit.wikimedia.org/r/286817

Change 286818 had a related patch set uploaded (by BBlack):
remove do_spdy conditional, all h2, 1/2

https://gerrit.wikimedia.org/r/286818

Change 286819 had a related patch set uploaded (by BBlack):
remove do_spdy hieradata, all h2, 2/2

https://gerrit.wikimedia.org/r/286819

Change 286816 merged by BBlack:
cache_upload HTTP/2 switch

https://gerrit.wikimedia.org/r/286816

Change 286817 merged by BBlack:
cache_text HTTP/2 switch

https://gerrit.wikimedia.org/r/286817

BBlack added a comment.May 4 2016, 1:34 PM

text and upload have been converted now as well, so all cache clusters have made the HTTP/2 switch. I've also done a last-minute fixup to the varnishxcps script (it didn't support the key h2 in a regex), and now https://grafana.wikimedia.org/dashboard/db/client-connections shows combined spdy and h2 percentages, so that we can see the diff.

There's some noise around the transition + late varnishxcps fixup (where it goes to 100% and drops way too low, then levels back out), but ignoring that things look sane. The drop does seem around 2% vs last week so far (which seems strange to me - I would've expected more diff from our systemtap stats due to per-connection vs per-request).

ori added a comment.May 4 2016, 4:36 PM

Thanks so much, @BBlack. Rolling this out at this pace with no significant interruptions is very impressive.

Change 286818 merged by BBlack:
remove do_spdy conditional, all h2, 1/2

https://gerrit.wikimedia.org/r/286818

Change 286819 merged by BBlack:
remove do_spdy hieradata, all h2, 2/2

https://gerrit.wikimedia.org/r/286819

BBlack closed this task as Resolved.May 4 2016, 5:15 PM
BBlack claimed this task.
BBlack added a comment.EditedJul 11 2016, 3:11 PM

Graph of SPDY/3 and H/2 connection percentages for 1 month before and after the May 4th transition date: