Page MenuHomePhabricator

Support TLSv1.3
Open, MediumPublic

Description

Task to track our eventual TLSv1.3 support. Currently we're blocked on deploying a stable OpenSSL-1.1.1 release, but there's some prep work to be done on the ciphersuite and nginx sides as well. see also T205378: Enable ESNI support on Wikimedia servers

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -2
operations/puppetproduction+22 -88
operations/puppetproduction+22 -0
operations/puppetproduction+22 -0
operations/puppetproduction+44 -187
operations/puppetproduction+42 -0
operations/puppetproduction+42 -0
operations/debs/trafficservermaster+23 -0
operations/puppetproduction+21 -0
operations/puppetproduction+2 -4
operations/puppetproduction+6 -0
operations/puppetproduction+22 -0
operations/puppetproduction+21 -0
operations/puppetproduction+17 -0
operations/puppetproduction+10 -0
operations/puppetproduction+2 -12
operations/puppetproduction+16 -2
operations/puppetproduction+3 -3
operations/puppetproduction+11 -11
operations/puppetproduction+11 -0
operations/puppetproduction+5 -2
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
StalledNone
OpenVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
OpenVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
DeclinedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 571976 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable TLSv1.3 for ats-be <--> applayer communication

https://gerrit.wikimedia.org/r/571976

Change 571978 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ssl_ciphersuite: Fix TLSv1.3 ciphersuites names

https://gerrit.wikimedia.org/r/571978

Change 571976 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable TLSv1.3 for ats-be <--> applayer communication

https://gerrit.wikimedia.org/r/571976

Macro votecat:

Change 571985 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ssl_ciphersuite: Enable TLSv1.3 where available

https://gerrit.wikimedia.org/r/571985

Change 571988 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050

https://gerrit.wikimedia.org/r/571988

Change 571988 merged by Vgutierrez:
[operations/puppet@production] ATS: Test TLSv1.3 on ats-be <--> applayer communication on cp3050

https://gerrit.wikimedia.org/r/571988

Mentioned in SAL (#wikimedia-operations) [2020-02-13T14:51:09Z] <vgutierrez> test TLSv1.3 between ats-be and applayer in cp3050 - T170567

Change 572004 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Use TLSv1.3 on ats-be <--> applayer on esams

https://gerrit.wikimedia.org/r/572004

Change 572004 merged by Vgutierrez:
[operations/puppet@production] ATS: Use TLSv1.3 on ats-be <--> applayer on esams

https://gerrit.wikimedia.org/r/572004

Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:27:19Z] <vgutierrez> turning on TLSv1.3 between ats-be and applayer in cp30[51-52] - T170567

Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:42:20Z] <vgutierrez> rolling restart of ats-be on esams - T170567

Change 571978 merged by Vgutierrez:
[operations/puppet@production] ssl_ciphersuite: Fix TLSv1.3 ciphersuites names

https://gerrit.wikimedia.org/r/571978

Change 571985 merged by Vgutierrez:
[operations/puppet@production] ssl_ciphersuite: Enable TLSv1.3 where available

https://gerrit.wikimedia.org/r/571985

Sites running nginx or apache outside the caching cluster that have been upgraded to buster are now offering TLSv1.3: a few examples:

$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect gerrit.wikimedia.org:443 2>&1 < /dev/null |grep -i Cipher
New, TLSv1.3, Cipher is TLS_CHACHA20_POLY1305_SHA256
$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect netbox.wikimedia.org:443 2>&1 < /dev/null |grep -i Cipher
New, TLSv1.3, Cipher is TLS_CHACHA20_POLY1305_SHA256
$ /usr/local/opt/openssl@1.1/bin/openssl s_client -connect en.wikipedia.com:443 2>&1 < /dev/null |grep -i Cipher
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384

Change 571976 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable TLSv1.3 for ats-be <--> applayer communication

https://gerrit.wikimedia.org/r/571976

Mentioned in SAL (#wikimedia-operations) [2020-02-14T10:14:42Z] <vgutierrez> rolling restart of ats-be to enable TLSv1.3 against origin servers - T170567

CDanis added a subscriber: CDanis.Mar 2 2020, 9:32 AM

Change 580174 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Add session_ticket_number to Inbound_TLS_settings

https://gerrit.wikimedia.org/r/580174

Change 580288 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Consider TLSv1.3 on tls.lua

https://gerrit.wikimedia.org/r/580288

Change 580174 merged by Vgutierrez:
[operations/puppet@production] ATS: Add session_ticket_number to Inbound_TLS_settings

https://gerrit.wikimedia.org/r/580174

Change 580326 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] varnish: Consider TLSv1.3 on log_xcps_info

https://gerrit.wikimedia.org/r/580326

Change 580288 merged by Vgutierrez:
[operations/puppet@production] ATS: Consider TLSv1.3 on tls.lua

https://gerrit.wikimedia.org/r/580288

Change 580326 merged by Vgutierrez:
[operations/puppet@production] varnish: Consider TLSv1.3 on log_xcps_info

https://gerrit.wikimedia.org/r/580326

Change 580742 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 for upload@ulsfo

https://gerrit.wikimedia.org/r/580742

Mentioned in SAL (#wikimedia-operations) [2020-03-18T08:14:32Z] <vgutierrez> upgrade ATS to 8.0.6-1wm3 in ulsfo - T170567

Change 580742 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 for upload@ulsfo

https://gerrit.wikimedia.org/r/580742

Mentioned in SAL (#wikimedia-operations) [2020-03-18T09:18:15Z] <vgutierrez> enabling inbound TLSv1.3 in cp4026 - T170567

Change 580868 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] prometheus: Add TLSv1.3 ciphersuites on ATS exporter

https://gerrit.wikimedia.org/r/580868

Change 580868 merged by Vgutierrez:
[operations/puppet@production] prometheus: Add TLSv1.3 ciphersuites on ATS exporter

https://gerrit.wikimedia.org/r/580868

Mentioned in SAL (#wikimedia-operations) [2020-03-18T09:43:57Z] <vgutierrez> enabling inbound TLSv1.3 in upload@ulsfo - T170567

Change 580951 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable TLS Session tickets in ulsfo

https://gerrit.wikimedia.org/r/580951

Change 580951 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable TLS Session tickets in ulsfo

https://gerrit.wikimedia.org/r/580951

Change 583292 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on upload@eqsin

https://gerrit.wikimedia.org/r/583292

Mentioned in SAL (#wikimedia-operations) [2020-03-25T09:23:34Z] <vgutierrez> upgrade ATS to 8.0.6-1wm3 on upload@eqsin - T170567

Change 583292 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on upload@eqsin

https://gerrit.wikimedia.org/r/583292

Mentioned in SAL (#wikimedia-operations) [2020-03-25T09:54:36Z] <vgutierrez> Enable inbound TLSv1.3 on upload@eqsin - T170567

Change 583715 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.6-1wm4

https://gerrit.wikimedia.org/r/583715

Change 583715 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.6-1wm4

https://gerrit.wikimedia.org/r/583715

Mentioned in SAL (#wikimedia-operations) [2020-03-27T10:04:31Z] <vgutierrez> upload trafficserver 8.0.6-1wm4 to apt.wm.o (buster) - T245616 T170567

Change 585426 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@esams

https://gerrit.wikimedia.org/r/585426

Change 585426 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@esams

https://gerrit.wikimedia.org/r/585426

Mentioned in SAL (#wikimedia-operations) [2020-04-02T08:22:19Z] <vgutierrez> Enable inbound TLSv1.3 in upload@esams - T170567

Change 585492 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@codfw

https://gerrit.wikimedia.org/r/585492

Change 585492 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in upload@codfw

https://gerrit.wikimedia.org/r/585492

Mentioned in SAL (#wikimedia-operations) [2020-04-02T14:33:56Z] <vgutierrez> Enable inbound TLSv1.3 in upload@codfw - T170567

Change 585697 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on the upload cluster

https://gerrit.wikimedia.org/r/585697

Change 585697 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 on the upload cluster

https://gerrit.wikimedia.org/r/585697

Mentioned in SAL (#wikimedia-operations) [2020-04-06T05:16:42Z] <vgutierrez> Enable inbound TLSv1.3 in upload@eqiad - T170567

Change 587423 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@ulsfo

https://gerrit.wikimedia.org/r/587423

Change 587423 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@ulsfo

https://gerrit.wikimedia.org/r/587423

Mentioned in SAL (#wikimedia-operations) [2020-04-08T13:22:14Z] <vgutierrez> enable inbound TLSv1.3 in text@ulsfo - T170567

Change 588678 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@eqsin

https://gerrit.wikimedia.org/r/588678

Change 588678 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 in text@eqsin

https://gerrit.wikimedia.org/r/588678

Mentioned in SAL (#wikimedia-operations) [2020-04-14T12:50:39Z] <vgutierrez> Enable inbound TLSv1.3 in text@eqsin - T170567

Change 589030 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable inbound TLSv1.3 globally

https://gerrit.wikimedia.org/r/589030

Change 589030 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable inbound TLSv1.3 globally

https://gerrit.wikimedia.org/r/589030

Mentioned in SAL (#wikimedia-operations) [2020-04-16T10:44:08Z] <vgutierrez> rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567

Vgutierrez closed this task as Resolved.Apr 16 2020, 1:02 PM

TLSv1.3 is now available on both text and upload clusters :)

Gilles added a subscriber: Gilles.Apr 17 2020, 8:17 AM

Mentioned in SAL (#wikimedia-operations) [2020-04-16T10:44:08Z] <vgutierrez> rolling restart of ats-tls to enable TLSv1.3 globally and disable the old TLS session cache - T170567

This, and/or the ATS version upgrade, appears to have caused an extra regression in response time:

https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&var-source=navtiming2&var-metric=responseStart&var-percentile=p50&from=1587009985039&to=1587111099058

Still a 6.5% regression on the week-over-week p75 22 hours after this deployment.

That's weird, TLSv1.3 is famous for being faster than v1.2.

In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real world follow suite with the theory. I don't know if we've enabled 0-RTT in this deployment. If we didn't, then I don't know if there are other areas of 1.3 that are supposed to bring improvements. I imagine there might be new ciphers involved, which might actually perform worse on the client or ATS alike.

If the regression is related to this TLS 1.3 rollout (one very easy way to find out is to undo it temporarily), it's probably not with the handshake part, as the "ssl" section of navigation timing doesn't seem to change significantly around the time of that deployment. It was already doing better week-over-week prior to April 16:

https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&refresh=5m&var-source=navtiming2&var-metric=ssl&var-percentile=p50

I've since found something interesting, which is that while the response start time (when the first bytes from the server arrive) is delayed, the response time (between first and last byte) has reduced at the same time:

https://grafana.wikimedia.org/d/000000143/navigation-timing?orgId=1&refresh=5m&var-source=navtiming2&var-metric=response&var-percentile=p50

This might suggest that TLS 1.3's nature leads the client to measure the first byte received later (maybe it actually received it on the wire at the same time), but this is partially offset by a faster transfer after that.

That being said, it's not enough to offset the regression completely, as seen through loadEventEnd later down the line, which regresses exactly at the same time:

To sum up it seems like a regression overall, which some changes in the intermediary timeline that look like gains, that aren't enough to offset the overall regression.

we haven't deployed 0-RTT at this time but even without it, a full TLSv1.3 handshake requires 1 RTT less than a full TLSv1.2 handshake. Thanks for the detailed report @Gilles, I'll hunt this down ASAP

In theory with features like 0-RTT resumption of course, but that doesn't mean that implementation, configuration and the real world follow suite with the theory. I don't know if we've enabled 0-RTT in this deployment. If we didn't, then I don't know if there are other areas of 1.3 that are supposed to bring improvements. I imagine there might be new ciphers involved, which might actually perform worse on the client or ATS alike.

No, even without 0-RTT, the handshakes are faster (twice, or in general 100ms faster): https://kinsta.com/blog/tls-1-3/

ssingh added a subscriber: ssingh.Apr 17 2020, 3:56 PM
Krinkle reopened this task as Open.Apr 20 2020, 12:47 AM
Krinkle added a project: Wikimedia-Incident.
Krinkle added a subscriber: Krinkle.

Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can at least isolate/validate the correlation by undoing this for a few hours (if that's feasible.) If the correlation holds up, I would suggest we keep it rolled back for now as this would otherwise make two major on-going perf incidents – noting that we haven't fixed the other one yet (T238494), and there's also numerous non-perf related incidents being worked on right now with higher priority.

ema added a subscriber: ema.Apr 20 2020, 2:23 PM

Re-opening and tracking as on-going perf incident per the above. As @Gilles mentioned, it would help if we can at least isolate/validate the correlation by undoing this for a few hours (if that's feasible.) If the correlation holds up, I would suggest we keep it rolled back for now as this would otherwise make two major on-going perf incidents – noting that we haven't fixed the other one yet (T238494), and there's also numerous non-perf related incidents being worked on right now with higher priority.

We really need to have per-host performance metrics (T238086) to evaluate the impact of changes like this on a single host, rather than having to spend hours to do/undo things fleet-wide.

Change 593192 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Re-enable the session ID based cache

https://gerrit.wikimedia.org/r/593192

Change 593192 merged by Vgutierrez:
[operations/puppet@production] ATS: Re-enable the TLS session ID based cache

https://gerrit.wikimedia.org/r/593192

Mentioned in SAL (#wikimedia-operations) [2020-04-29T09:10:18Z] <vgutierrez> starting rolling restart of ats-tls to enable the TLS session ID based cache - T170567

(See also Navigation Timing metrics spec.)

request
(after dns+tcp+tls), delta from requestStart to requestStart).
response
delta from responseStart to responseEnd.
responseStart
overall time to first byte from the beginning (includes dns+tcp+tls+request)
domInteractive
overall time to end of parsed HTML response
(includes dns+tcp+tls+request+request+client-side HTML parsing)