Task to track our eventual TLSv1.3 support. Currently we're blocked on deploying a stable OpenSSL-1.1.1 release, but there's some prep work to be done on the ciphersuite and nginx sides as well. see also T205378: Support ECH on Wikimedia servers
Description
Details
Event Timeline
Change 571985 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ssl_ciphersuite: Enable TLSv1.3 where available
Change 571988 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050
Change 571988 merged by Vgutierrez:
[operations/puppet@production] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050
Mentioned in SAL (#wikimedia-operations) [2020-02-13T14:51:09Z] <vgutierrez> test TLSv1.3 between ats-be and applayer in cp3050 - T170567
Change 572004 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Use TLSv1.3 on ats-be <--> applayer on esams
Change 572004 merged by Vgutierrez:
[operations/puppet@production] ATS: Use TLSv1.3 on ats-be <--> applayer on esams
Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:27:19Z] <vgutierrez> turning on TLSv1.3 between ats-be and applayer in cp30[51-52] - T170567
Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:42:20Z] <vgutierrez> rolling restart of ats-be on esams - T170567
Change 571978 merged by Vgutierrez:
[operations/puppet@production] ssl_ciphersuite: Fix TLSv1.3 ciphersuites names
Change 571985 merged by Vgutierrez:
[operations/puppet@production] ssl_ciphersuite: Enable TLSv1.3 where available
Sites running nginx or apache outside the caching cluster that have been upgraded to buster are now offering TLSv1.3: a few examples:
$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect gerrit.wikimedia.org:443 2>&1 < /dev/null |grep -i Cipher New, TLSv1.3, Cipher is TLS_CHACHA20_POLY1305_SHA256 $ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect netbox.wikimedia.org:443 2>&1 < /dev/null |grep -i Cipher New, TLSv1.3, Cipher is TLS_CHACHA20_POLY1305_SHA256 $ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect en.wikipedia.com:443 2>&1 < /dev/null |grep -i Cipher New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Change 571976 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable TLSv1.3 for ats-be <--> applayer communication
Mentioned in SAL (#wikimedia-operations) [2020-02-14T10:14:42Z] <vgutierrez> rolling restart of ats-be to enable TLSv1.3 against origin servers - T170567
Change 580174 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Add session_ticket_number to Inbound_TLS_settings
Change 580288 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Consider TLSv1.3 on tls.lua
Change 580174 merged by Vgutierrez:
[operations/puppet@production] ATS: Add session_ticket_number to Inbound_TLS_settings
Change 580326 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] varnish: Consider TLSv1.3 on log_xcps_info
Change 580288 merged by Vgutierrez:
[operations/puppet@production] ATS: Consider TLSv1.3 on tls.lua
Change 580326 merged by Vgutierrez:
[operations/puppet@production] varnish: Consider TLSv1.3 on log_xcps_info
Change 580742 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 for upload@ulsfo
Mentioned in SAL (#wikimedia-operations) [2020-03-18T08:14:32Z] <vgutierrez> upgrade ATS to 8.0.6-1wm3 in ulsfo - T170567
Change 580742 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 for upload@ulsfo
Mentioned in SAL (#wikimedia-operations) [2020-03-18T09:18:15Z] <vgutierrez> enabling inbound TLSv1.3 in cp4026 - T170567
Change 580868 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] prometheus: Add TLSv1.3 ciphersuites on ATS exporter
Change 580868 merged by Vgutierrez:
[operations/puppet@production] prometheus: Add TLSv1.3 ciphersuites on ATS exporter
Mentioned in SAL (#wikimedia-operations) [2020-03-18T09:43:57Z] <vgutierrez> enabling inbound TLSv1.3 in upload@ulsfo - T170567
Change 580951 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable TLS Session tickets in ulsfo
Change 580951 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable TLS Session tickets in ulsfo
Mentioned in SAL (#wikimedia-operations) [2020-03-18T14:41:32Z] <vgutierrez> disable TLS session tickets in ulsfo - T245616 T170567
Change 583292 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on upload@eqsin
Mentioned in SAL (#wikimedia-operations) [2020-03-25T09:23:34Z] <vgutierrez> upgrade ATS to 8.0.6-1wm3 on upload@eqsin - T170567
Change 583292 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on upload@eqsin
Mentioned in SAL (#wikimedia-operations) [2020-03-25T09:54:36Z] <vgutierrez> Enable inbound TLSv1.3 on upload@eqsin - T170567
Change 583715 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.6-1wm4
Change 583715 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.6-1wm4
Mentioned in SAL (#wikimedia-operations) [2020-03-27T10:04:31Z] <vgutierrez> upload trafficserver 8.0.6-1wm4 to apt.wm.o (buster) - T245616 T170567
Change 585426 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@esams
Change 585426 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@esams
Mentioned in SAL (#wikimedia-operations) [2020-04-02T08:22:19Z] <vgutierrez> Enable inbound TLSv1.3 in upload@esams - T170567
Change 585492 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@codfw
Change 585492 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@codfw
Mentioned in SAL (#wikimedia-operations) [2020-04-02T14:33:56Z] <vgutierrez> Enable inbound TLSv1.3 in upload@codfw - T170567
Change 585697 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on the upload cluster
Change 585697 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on the upload cluster
Mentioned in SAL (#wikimedia-operations) [2020-04-06T05:16:42Z] <vgutierrez> Enable inbound TLSv1.3 in upload@eqiad - T170567
Change 587423 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@ulsfo
Change 587423 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@ulsfo
Mentioned in SAL (#wikimedia-operations) [2020-04-08T13:22:14Z] <vgutierrez> enable inbound TLSv1.3 in text@ulsfo - T170567
Change 588678 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@eqsin
Change 588678 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@eqsin
Mentioned in SAL (#wikimedia-operations) [2020-04-14T12:50:39Z] <vgutierrez> Enable inbound TLSv1.3 in text@eqsin - T170567
Change 589030 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 globally
Change 589030 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 globally
Mentioned in SAL (#wikimedia-operations) [2020-04-16T10:44:08Z] <vgutierrez> rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567
This, and/or the ATS version upgrade, appears to have caused an extra regression in response time:
Still a 6.5% regression on the week-over-week p75 22 hours after this deployment.
In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real world follow suite with the theory. I don't know if we've enabled 0-RTT in this deployment. If we didn't, then I don't know if there are other areas of 1.3 that are supposed to bring improvements. I imagine there might be new ciphers involved, which might actually perform worse on the client or ATS alike.
If the regression is related to this TLS 1.3 rollout (one very easy way to find out is to undo it temporarily), it's probably not with the handshake part, as the "ssl" section of navigation timing doesn't seem to change significantly around the time of that deployment. It was already doing better week-over-week prior to April 16:
I've since found something interesting, which is that while the response start time (when the first bytes from the server arrive) is delayed, the response time (between first and last byte) has reduced at the same time:
This might suggest that TLS 1.3's nature leads the client to measure the first byte received later (maybe it actually received it on the wire at the same time), but this is partially offset by a faster transfer after that.
That being said, it's not enough to offset the regression completely, as seen through loadEventEnd later down the line, which regresses exactly at the same time:
To sum up it seems like a regression overall, which some changes in the intermediary timeline that look like gains, that aren't enough to offset the overall regression.
we haven't deployed 0-RTT at this time but even without it, a full TLSv1.3 handshake requires 1 RTT less than a full TLSv1.2 handshake. Thanks for the detailed report @Gilles, I'll hunt this down ASAP
No, even without 0-RTT, the handshakes are faster (twice, or in general 100ms faster): https://kinsta.com/blog/tls-1-3/
Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can at least isolate/validate the correlation by undoing this for a few hours (if that's feasible.) If the correlation holds up, I would suggest we keep it rolled back for now as this would otherwise make two major on-going perf incidents – noting that we haven't fixed the other one yet (T238494), and there's also numerous non-perf related incidents being worked on right now with higher priority.
We really need to have per-host performance metrics (T238086) to evaluate the impact of changes like this on a single host, rather than having to spend hours to do/undo things fleet-wide.
Change 593192 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Re-enable the session ID based cache
Change 593192 merged by Vgutierrez:
[operations/puppet@production] ATS: Re-enable the TLS session ID based cache
Mentioned in SAL (#wikimedia-operations) [2020-04-29T09:10:18Z] <vgutierrez> starting rolling restart of ats-tls to enable the TLS session ID based cache - T170567
(See also Navigation Timing metrics spec.)
request |
---|
(after dns+tcp+tls), delta from requestStart to requestStart). |
response |
---|
delta from responseStart to responseEnd. |
responseStart |
---|
overall time to first byte from the beginning (includes dns+tcp+tls+request) |
domInteractive |
---|
overall time to end of parsed HTML response |
(includes dns+tcp+tls+request+request+client-side HTML parsing) |
Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)
This is done, isn't it? The performance issues are being mitigated by migrating to nginx light I think (someone needs to double check)
TLSv1.3 is up & running, performance issues are being mitigated by replacing ats-tls with envoy or haproxy in the short term :)
TLSv1.3 has been working for quite some time! Any other issues should be in other tickets (and are, in some cases!).