Page MenuHomePhabricator

TLS handshake issues with ATS 8.0.5-1wm2
Closed, ResolvedPublic

Description

After replacing nginx with ats-tls on cp5001 and running smoothly for 1 hour the following icinga alert was triggered:

PROBLEM - HTTPS Unified ECDSA on cp5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer

Further inspection on cp5001 showed TLS handshake errors using openssl s_client -connect 127.0.0.1:443 -status

After depooling the host I failed to reproduce the issue with synthetic requests using curl

traffic_server showed the following metrics regarding SSL erros:

proxy.process.ssl.ssl_error_want_write 1249276
proxy.process.ssl.ssl_error_want_read 8512210
proxy.process.ssl.ssl_error_want_x509_lookup 0
proxy.process.ssl.ssl_error_syscall 483415
proxy.process.ssl.ssl_error_read_eos 0
proxy.process.ssl.ssl_error_zero_return 57254
proxy.process.ssl.ssl_error_ssl 12721

Details

Related Gerrit Patches:

Event Timeline

Vgutierrez triaged this task as Medium priority.Aug 27 2019, 3:40 AM
Vgutierrez created this task.
Vgutierrez moved this task from Triage to TLS on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2019-08-27T03:53:44Z] <vgutierrez> repooling cp5001 - T231262

Mentioned in SAL (#wikimedia-operations) [2019-08-27T03:59:13Z] <vgutierrez> depooling cp5001 - T231262

Further testing shows that the issue is apparently not related to OCSP stapling:

vgutierrez@cp5001:~$ openssl s_client -connect 127.0.0.1:443 < /dev/null
CONNECTED(00000003)
write:errno=104
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 176 bytes
Verification: OK
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID:
    Session-ID-ctx:
    Master-Key:
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1566878252
    Timeout   : 7200 (sec)
    Verify return code: 0 (ok)
    Extended master secret: no
---

Change 532508 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow logging specific debug tags to diags.log

https://gerrit.wikimedia.org/r/532508

Change 532513 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] hiera: Enable ssl.error and ssl-diag logging on cp5001

https://gerrit.wikimedia.org/r/532513

Mentioned in SAL (#wikimedia-traffic) [2019-08-27T07:04:45Z] <vgutierrez> repooling cp5001 - T231262

Mentioned in SAL (#wikimedia-traffic) [2019-08-27T07:07:31Z] <vgutierrez> depooling cp5001 - T231262

from a tcpdump capture, it looks like ATS is actually dropping connections:

1007 153.027859    127.0.0.1 → 127.0.0.1    TCP 74 60211 → 443 [SYN] Seq=0 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=3584220 TSecr=0 WS=512
1008 153.027874    127.0.0.1 → 127.0.0.1    TCP 74 443 → 60211 [SYN, ACK] Seq=0 Ack=1 Win=43690 Len=0 MSS=65495 SACK_PERM=1 TSval=3584220 TSecr=3584220 WS=512
1009 153.027881    127.0.0.1 → 127.0.0.1    TCP 66 60211 → 443 [ACK] Seq=1 Ack=1 Win=44032 Len=0 TSval=3584220 TSecr=3584220
1010 153.027983    127.0.0.1 → 127.0.0.1    TLSv1 242 Client Hello
1011 153.027995    127.0.0.1 → 127.0.0.1    TCP 66 443 → 60211 [ACK] Seq=1 Ack=177 Win=45056 Len=0 TSval=3584220 TSecr=3584220
1012 153.028014    127.0.0.1 → 127.0.0.1    TCP 66 443 → 60211 [RST, ACK] Seq=1 Ack=177 Win=45056 Len=0 TSval=3584220 TSecr=3584220

Further analysis of ats-tls metrics shows that connections were actually being dropped without being logged:

vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls metric match throttle
proxy.process.http.throttled_proxy_only 0
proxy.process.http.origin_connections_throttled_out 0
proxy.process.net.connections_throttled_in 313323
proxy.process.net.connections_throttled_out 0
vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls config get proxy.config.net.connections_throttle
proxy.config.net.connections_throttle: 30000

Change 532513 abandoned by Vgutierrez:
hiera: Enable ssl debugging for ATS on cp5001

https://gerrit.wikimedia.org/r/532513

Change 532508 abandoned by Vgutierrez:
ATS: Allow logging specific debug tags to diags.log for localhost requests

https://gerrit.wikimedia.org/r/532508

Change 532555 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Allow configure connections_throttle

https://gerrit.wikimedia.org/r/532555

Change 532555 merged by Vgutierrez:
[operations/puppet@production] ATS: Allow configuring connections_throttle

https://gerrit.wikimedia.org/r/532555

Mentioned in SAL (#wikimedia-operations) [2019-08-27T08:36:37Z] <vgutierrez> repooling cp5001 - T231262

Vgutierrez closed this task as Resolved.Aug 27 2019, 8:53 AM

After disable the connection throttling, cp5001 behaves as expected and no longer drops connections:

vgutierrez@cp5001:~$ sudo -i traffic_ctl --run-root=/srv/trafficserver/tls metric match throttle
proxy.process.http.throttled_proxy_only 0
proxy.process.http.origin_connections_throttled_out 0
proxy.process.net.connections_throttled_in 0
proxy.process.net.connections_throttled_out 0