Page MenuHomePhabricator

Support TLSv1.3
Closed, ResolvedPublic

Assigned To
Authored By
BBlack
Jul 13 2017, 1:59 PM
Referenced Files
F31836387: Screenshot 2020-05-22 at 01.38.51.png
May 22 2020, 12:47 AM
F31836392: Screenshot 2020-05-22 at 01.38.19.png
May 22 2020, 12:47 AM
F31836377: Screenshot 2020-05-22 at 01.37.37.png
May 22 2020, 12:47 AM
F31836384: Screenshot 2020-05-22 at 01.30.01.png
May 22 2020, 12:47 AM
F31761610: Screenshot 2020-04-17 at 15.17.45.png
Apr 17 2020, 1:18 PM
F31761606: Screenshot 2020-04-17 at 15.15.06.png
Apr 17 2020, 1:18 PM
F31761229: Screenshot 2020-04-17 at 10.15.46.png
Apr 17 2020, 8:17 AM
Tokens
"Mountain of Wealth" token, awarded by kolbert."Like" token, awarded by Ladsgroup.

Description

Task to track our eventual TLSv1.3 support. Currently we're blocked on deploying a stable OpenSSL-1.1.1 release, but there's some prep work to be done on the ciphersuite and nginx sides as well. see also T205378: Support ECH on Wikimedia servers

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+2 -2
operations/puppetproduction+22 -88
operations/puppetproduction+22 -0
operations/puppetproduction+22 -0
operations/puppetproduction+44 -187
operations/puppetproduction+42 -0
operations/puppetproduction+42 -0
operations/debs/trafficservermaster+23 -0
operations/puppetproduction+21 -0
operations/puppetproduction+2 -4
operations/puppetproduction+6 -0
operations/puppetproduction+22 -0
operations/puppetproduction+21 -0
operations/puppetproduction+17 -0
operations/puppetproduction+10 -0
operations/puppetproduction+2 -12
operations/puppetproduction+16 -2
operations/puppetproduction+3 -3
operations/puppetproduction+11 -11
operations/puppetproduction+11 -0
operations/puppetproduction+5 -2
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
InvalidVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
DeclinedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 571985 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ssl_ciphersuite: Enable TLSv1.3 where available

https://gerrit.wikimedia.org/r/571985

Change 571988 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050

https://gerrit.wikimedia.org/r/571988

Change 571988 merged by Vgutierrez:
[operations/puppet@production] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050

https://gerrit.wikimedia.org/r/571988

Mentioned in SAL (#wikimedia-operations) [2020-02-13T14:51:09Z] <vgutierrez> test TLSv1.3 between ats-be and applayer in cp3050 - T170567

Change 572004 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Use TLSv1.3 on ats-be <--> applayer on esams

https://gerrit.wikimedia.org/r/572004

Change 572004 merged by Vgutierrez:
[operations/puppet@production] ATS: Use TLSv1.3 on ats-be <--> applayer on esams

https://gerrit.wikimedia.org/r/572004

Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:27:19Z] <vgutierrez> turning on TLSv1.3 between ats-be and applayer in cp30[51-52] - T170567

Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:42:20Z] <vgutierrez> rolling restart of ats-be on esams - T170567

Change 571978 merged by Vgutierrez:
[operations/puppet@production] ssl_ciphersuite: Fix TLSv1.3 ciphersuites names

https://gerrit.wikimedia.org/r/571978

Change 571985 merged by Vgutierrez:
[operations/puppet@production] ssl_ciphersuite: Enable TLSv1.3 where available

https://gerrit.wikimedia.org/r/571985

Sites running nginx or apache outside the caching cluster that have been upgraded to buster are now offering TLSv1.3: a few examples:

$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect gerrit.wikimedia.org:443 2>&1 < /dev/null |grep -i Cipher
New, TLSv1.3, Cipher is TLS_CHACHA20_POLY1305_SHA256
$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect netbox.wikimedia.org:443 2>&1 < /dev/null |grep -i Cipher
New, TLSv1.3, Cipher is TLS_CHACHA20_POLY1305_SHA256
$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect en.wikipedia.com:443 2>&1 < /dev/null |grep -i Cipher
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384

Change 571976 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable TLSv1.3 for ats-be <--> applayer communication

https://gerrit.wikimedia.org/r/571976

Mentioned in SAL (#wikimedia-operations) [2020-02-14T10:14:42Z] <vgutierrez> rolling restart of ats-be to enable TLSv1.3 against origin servers - T170567

Change 580174 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Add session_ticket_number to Inbound_TLS_settings

https://gerrit.wikimedia.org/r/580174

Change 580288 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Consider TLSv1.3 on tls.lua

https://gerrit.wikimedia.org/r/580288

Change 580174 merged by Vgutierrez:
[operations/puppet@production] ATS: Add session_ticket_number to Inbound_TLS_settings

https://gerrit.wikimedia.org/r/580174

Change 580326 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] varnish: Consider TLSv1.3 on log_xcps_info

https://gerrit.wikimedia.org/r/580326

Change 580288 merged by Vgutierrez:
[operations/puppet@production] ATS: Consider TLSv1.3 on tls.lua

https://gerrit.wikimedia.org/r/580288

Change 580326 merged by Vgutierrez:
[operations/puppet@production] varnish: Consider TLSv1.3 on log_xcps_info

https://gerrit.wikimedia.org/r/580326

Change 580742 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 for upload@ulsfo

https://gerrit.wikimedia.org/r/580742

Mentioned in SAL (#wikimedia-operations) [2020-03-18T08:14:32Z] <vgutierrez> upgrade ATS to 8.0.6-1wm3 in ulsfo - T170567

Change 580742 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 for upload@ulsfo

https://gerrit.wikimedia.org/r/580742

Mentioned in SAL (#wikimedia-operations) [2020-03-18T09:18:15Z] <vgutierrez> enabling inbound TLSv1.3 in cp4026 - T170567

Change 580868 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] prometheus: Add TLSv1.3 ciphersuites on ATS exporter

https://gerrit.wikimedia.org/r/580868

Change 580868 merged by Vgutierrez:
[operations/puppet@production] prometheus: Add TLSv1.3 ciphersuites on ATS exporter

https://gerrit.wikimedia.org/r/580868

Mentioned in SAL (#wikimedia-operations) [2020-03-18T09:43:57Z] <vgutierrez> enabling inbound TLSv1.3 in upload@ulsfo - T170567

Change 580951 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable TLS Session tickets in ulsfo

https://gerrit.wikimedia.org/r/580951

Change 580951 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable TLS Session tickets in ulsfo

https://gerrit.wikimedia.org/r/580951

Change 583292 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on upload@eqsin

https://gerrit.wikimedia.org/r/583292

Mentioned in SAL (#wikimedia-operations) [2020-03-25T09:23:34Z] <vgutierrez> upgrade ATS to 8.0.6-1wm3 on upload@eqsin - T170567

Change 583292 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on upload@eqsin

https://gerrit.wikimedia.org/r/583292

Mentioned in SAL (#wikimedia-operations) [2020-03-25T09:54:36Z] <vgutierrez> Enable inbound TLSv1.3 on upload@eqsin - T170567

Change 583715 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.6-1wm4

https://gerrit.wikimedia.org/r/583715

Change 583715 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.6-1wm4

https://gerrit.wikimedia.org/r/583715

Mentioned in SAL (#wikimedia-operations) [2020-03-27T10:04:31Z] <vgutierrez> upload trafficserver 8.0.6-1wm4 to apt.wm.o (buster) - T245616 T170567

Change 585426 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@esams

https://gerrit.wikimedia.org/r/585426

Change 585426 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@esams

https://gerrit.wikimedia.org/r/585426

Mentioned in SAL (#wikimedia-operations) [2020-04-02T08:22:19Z] <vgutierrez> Enable inbound TLSv1.3 in upload@esams - T170567

Change 585492 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@codfw

https://gerrit.wikimedia.org/r/585492

Change 585492 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@codfw

https://gerrit.wikimedia.org/r/585492

Mentioned in SAL (#wikimedia-operations) [2020-04-02T14:33:56Z] <vgutierrez> Enable inbound TLSv1.3 in upload@codfw - T170567

Change 585697 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on the upload cluster

https://gerrit.wikimedia.org/r/585697

Change 585697 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on the upload cluster

https://gerrit.wikimedia.org/r/585697

Mentioned in SAL (#wikimedia-operations) [2020-04-06T05:16:42Z] <vgutierrez> Enable inbound TLSv1.3 in upload@eqiad - T170567

Change 587423 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@ulsfo

https://gerrit.wikimedia.org/r/587423

Change 587423 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@ulsfo

https://gerrit.wikimedia.org/r/587423

Mentioned in SAL (#wikimedia-operations) [2020-04-08T13:22:14Z] <vgutierrez> enable inbound TLSv1.3 in text@ulsfo - T170567

Change 588678 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@eqsin

https://gerrit.wikimedia.org/r/588678

Change 588678 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@eqsin

https://gerrit.wikimedia.org/r/588678

Mentioned in SAL (#wikimedia-operations) [2020-04-14T12:50:39Z] <vgutierrez> Enable inbound TLSv1.3 in text@eqsin - T170567

Change 589030 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 globally

https://gerrit.wikimedia.org/r/589030

Change 589030 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 globally

https://gerrit.wikimedia.org/r/589030

Mentioned in SAL (#wikimedia-operations) [2020-04-16T10:44:08Z] <vgutierrez> rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567

TLSv1.3 is now available on both text and upload clusters :)

Mentioned in SAL (#wikimedia-operations) [2020-04-16T10:44:08Z] <vgutierrez> rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567

This, and/or the ATS version upgrade, appears to have caused an extra regression in response time:

https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&var-source=navtiming2&var-metric=responseStart&var-percentile=p50&from=1587009985039&to=1587111099058

Screenshot 2020-04-17 at 10.15.46.png (378×1 px, 78 KB)

Still a 6.5% regression on the week-over-week p75 22 hours after this deployment.

That's weird, TLSv1.3 is famous for being faster than v1.2.

In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real world follow suite with the theory. I don't know if we've enabled 0-RTT in this deployment. If we didn't, then I don't know if there are other areas of 1.3 that are supposed to bring improvements. I imagine there might be new ciphers involved, which might actually perform worse on the client or ATS alike.

If the regression is related to this TLS 1.3 rollout (one very easy way to find out is to undo it temporarily), it's probably not with the handshake part, as the "ssl" section of navigation timing doesn't seem to change significantly around the time of that deployment. It was already doing better week-over-week prior to April 16:

https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&refresh=5m&var-source=navtiming2&var-metric=ssl&var-percentile=p50

I've since found something interesting, which is that while the response start time (when the first bytes from the server arrive) is delayed, the response time (between first and last byte) has reduced at the same time:

https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&refresh=5m&var-source=navtiming2&var-metric=response&var-percentile=p50

Screenshot 2020-04-17 at 15.15.06.png (377×1 px, 84 KB)

This might suggest that TLS 1.3's nature leads the client to measure the first byte received later (maybe it actually received it on the wire at the same time), but this is partially offset by a faster transfer after that.

That being said, it's not enough to offset the regression completely, as seen through loadEventEnd later down the line, which regresses exactly at the same time:

Screenshot 2020-04-17 at 15.17.45.png (380×1 px, 101 KB)

To sum up it seems like a regression overall, which some changes in the intermediary timeline that look like gains, that aren't enough to offset the overall regression.

we haven't deployed 0-RTT at this time but even without it, a full TLSv1.3 handshake requires 1 RTT less than a full TLSv1.2 handshake. Thanks for the detailed report @Gilles, I'll hunt this down ASAP

In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real world follow suite with the theory. I don't know if we've enabled 0-RTT in this deployment. If we didn't, then I don't know if there are other areas of 1.3 that are supposed to bring improvements. I imagine there might be new ciphers involved, which might actually perform worse on the client or ATS alike.

No, even without 0-RTT, the handshakes are faster (twice, or in general 100ms faster): https://kinsta.com/blog/tls-1-3/

Krinkle added a project: Wikimedia-Incident.
Krinkle subscribed.

Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can at least isolate/validate the correlation by undoing this for a few hours (if that's feasible.) If the correlation holds up, I would suggest we keep it rolled back for now as this would otherwise make two major on-going perf incidents – noting that we haven't fixed the other one yet (T238494), and there's also numerous non-perf related incidents being worked on right now with higher priority.

Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can at least isolate/validate the correlation by undoing this for a few hours (if that's feasible.) If the correlation holds up, I would suggest we keep it rolled back for now as this would otherwise make two major on-going perf incidents – noting that we haven't fixed the other one yet (T238494), and there's also numerous non-perf related incidents being worked on right now with higher priority.

We really need to have per-host performance metrics (T238086) to evaluate the impact of changes like this on a single host, rather than having to spend hours to do/undo things fleet-wide.

Change 593192 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Re-enable the session ID based cache

https://gerrit.wikimedia.org/r/593192

Change 593192 merged by Vgutierrez:
[operations/puppet@production] ATS: Re-enable the TLS session ID based cache

https://gerrit.wikimedia.org/r/593192

Mentioned in SAL (#wikimedia-operations) [2020-04-29T09:10:18Z] <vgutierrez> starting rolling restart of ats-tls to enable the TLS session ID based cache - T170567

(See also Navigation Timing metrics spec.)

request
(after dns+tcp+tls), delta from requestStart to requestStart).
Screenshot 2020-05-22 at 01.37.37.png (1×2 px, 291 KB)
response
delta from responseStart to responseEnd.
Screenshot 2020-05-22 at 01.38.19.png (1×2 px, 269 KB)
responseStart
overall time to first byte from the beginning (includes dns+tcp+tls+request)
Screenshot 2020-05-22 at 01.30.01.png (1×2 px, 463 KB)
domInteractive
overall time to end of parsed HTML response
(includes dns+tcp+tls+request+request+client-side HTML parsing)
Screenshot 2020-05-22 at 01.38.51.png (1×2 px, 375 KB)

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

This is done, isn't it? The performance issues are being mitigated by migrating to nginx light I think (someone needs to double check)

TLSv1.3 is up & running, performance issues are being mitigated by replacing ats-tls with envoy or haproxy in the short term :)

BBlack assigned this task to Vgutierrez.

TLSv1.3 has been working for quite some time! Any other issues should be in other tickets (and are, in some cases!).