Page MenuHomePhabricator

HTTPS performance tuning
Closed, ResolvedPublic

Description

As part of our HTTPS scalability efforts, we should work towards improving the performance hit that HTTPS users incur right now.

More specifically:

  • SNI, as to be able to send smaller, targeted certificates to users. Rolled out in Nov/Dec 2014.
  • ECDSA Hybrid certificates (tracked separately, T86654)
  • Enable ALPN/NPN, without SPDY. This will signal UAs to use TLS False Start.
  • Tune to smaller TLS record sizes, potentially dynamic
  • OCSP stapling
  • Session cache tuning (check for hit ratio, increase cache, rollovers)

SPDY, although related, is not on-topic for this. There's T35890 tracking progress for that one.

All of the above are for the most part on a newer platform, cf. T86648.

Event Timeline

faidon created this task.Jan 13 2015, 2:51 PM
faidon raised the priority of this task from to High.
faidon updated the task description. (Show Details)
faidon added subscribers: Aklapper, faidon, mark, BBlack.
Seb35 added a subscriber: Seb35.Jan 22 2015, 3:02 PM
BBlack added a comment.EditedFeb 9 2015, 9:00 PM

NPN is working with the current jessie-based stack.
ALPN is apparently supported by the nginx codebase we're running, but will require an upgrade to OpenSSL 1.0.2 (jessie's on 1.0.1k) and a rebuild of nginx against that.
openssl 1.0.2 is currently in debian experimental ( https://packages.debian.org/experimental/openssl )

faidon updated the task description. (Show Details)Mar 9 2015, 4:27 PM
faidon set Security to None.

Congrats for getting jessie into prod! Can we project an ETA yet for the remaining tuning tasks? Thanks.

OCSP Stapling should happen by the end of the week, I think. Worst case mid-next-week. I've been merging up some of Faidon's previous puppet cert refactoring today, which I think will form the basis of how we puppetize this for production (which is the only hard part remaining; I've already sorted out the technicalities of doing it manually on our test instance).
Session cache tuning is a bit of an unknown, other than that we know that how we're tuned today is not totally unreasonable.
ECDSA is all about waiting on our vendor at this point, as Faidon has noted in T86654.

Nemo_bis updated the task description. (Show Details)Mar 18 2015, 12:14 PM

OCSP Stapling commits are here now: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:ocsp-stapling,n,z

Holding for the weekend, will get some Faidon-review and then merge->test->deploy early next week.

Change 198110 had a related patch set uploaded (by Faidon Liambotis):
protoproxy/sslcert/cache: nginx ssl_stapling_file support

https://gerrit.wikimedia.org/r/198110

Change 198110 merged by BBlack:
protoproxy/sslcert/cache: nginx ssl_stapling_file support

https://gerrit.wikimedia.org/r/198110

OCSP Stapling testing on cp1008 looks good so far. I want to leave it on just the test host for a day or two first, though, to observe at least the medium-term behavior of the updater script + icinga checks before deploying wider. So probably Thursday for the prod clusters.

BBlack updated the task description. (Show Details)Mar 25 2015, 6:42 PM
BBlack added a subscriber: ori.Mar 25 2015, 6:45 PM

OCSP went out to all clusters today, as the testing over the past ~24H looked pretty good. Filed a future task to improve the robustness and adaptability of the updater: T93927 .

Also, @ori found a TCP optimization we could use, that was merged last night as well: https://gerrit.wikimedia.org/r/#/c/199556/

akosiaris lowered the priority of this task from High to Normal.Aug 7 2015, 1:37 PM
akosiaris added a subscriber: akosiaris.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 7 2015, 1:37 PM
BBlack updated the task description. (Show Details)Aug 7 2015, 1:39 PM
BBlack closed this task as Resolved.Aug 7 2015, 1:42 PM
BBlack claimed this task.

Basically everything we talked about here is already well-handled and/or we moved in a different direction and this ticket's text is simply outdated. Of course, perf is always something we'll keep improving on. The only real oustanding thing here is about session caching: we're doing about as well as we can with raw sessionids and ipvs-sh, and the next step is RFC5077 really. We can re-open the old ticket for that (T86671) or make a new one...

BBlack moved this task from Triage to Done on the Traffic board.Aug 7 2015, 1:44 PM