Page MenuHomePhabricator

ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled
Closed, ResolvedPublic

Description

as spotted by @BBlack, cp3030 shows 34k ESTABLISHED sockets against port 443 (ats-tls) even when the server was depooled for a few hours.

Comparing cp5007 vs the rest of text nodes (running nginx) show a huge different in sockets as well. Grafana reports ~100k sockets in use for cp5007 and ~37k for the rest of servers

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-10-25T05:03:15Z] <vgutierrez> Applying a SSL handshake timeout of 60 secs on ats-tls/cp5007 - T236458

I'm tracking used TCP sockets on eqsin text nodes in https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=now-1h&to=now, I've manually applied a SSL handshake timeout on cp5007 at 05:03 UTC

Vgutierrez triaged this task as Medium priority.Oct 25 2019, 5:31 AM
Vgutierrez moved this task from Backlog to TLS on the Traffic board.

Change 546014 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable the SSL handshake timeout and set it to 60 seconds

https://gerrit.wikimedia.org/r/546014

Change 546014 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable the SSL handshake timeout and set it to 60 seconds

https://gerrit.wikimedia.org/r/546014

Mentioned in SAL (#wikimedia-operations) [2019-10-25T08:02:22Z] <vgutierrez> rolling restart of ats-tls to introduce a SSL handshake timeout of 60 secs - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-29T08:06:05Z] <vgutierrez> restarting ats-tls on cp5007 with TCP FASTOPEN disabled - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-29T09:27:27Z] <vgutierrez> restart ats-tls on cp5007 disabling TCP SO_LINGER - T236458

Change 546957 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Track total client connections for HTTP/2 clients

https://gerrit.wikimedia.org/r/546957

Change 546957 merged by Vgutierrez:
[operations/puppet@production] ATS: Track total client connections for HTTP/2 clients

https://gerrit.wikimedia.org/r/546957

Mentioned in SAL (#wikimedia-operations) [2019-10-29T15:25:45Z] <vgutierrez> restarting ats-tls on cp5007 with a default inactivity timeout of 5 minutes and half open disabled - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-30T02:40:04Z] <vgutierrez> restarting ats-tls on cp3050 with half open disabled - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-30T03:09:06Z] <vgutierrez> Rolling restart of prometheus-exporter-trafficserver-tls - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-30T04:24:52Z] <vgutierrez> restarting ats-tls on cp4027 with half open disabled - T236458

Change 547073 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Set the default activity timeout to 300 seconds on ats-tls

https://gerrit.wikimedia.org/r/547073

Change 547073 merged by Vgutierrez:
[operations/puppet@production] ATS: Set the default activity timeout to 300 seconds on ats-tls

https://gerrit.wikimedia.org/r/547073

Mentioned in SAL (#wikimedia-operations) [2019-10-30T05:58:10Z] <vgutierrez> Rolling restart of ats-tls to get rid of leaked sockets and benefit from the lower inactivity timeout - T236458

Mentioned in SAL (#wikimedia-operations) [2020-02-13T15:38:31Z] <vgutierrez> disable allow_half_open on ats-tls @ cp4031 - T236458

Change 572016 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Puppetize allow_half_open HTTP setting

https://gerrit.wikimedia.org/r/572016

Change 572017 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable allow_half_open in cp4031

https://gerrit.wikimedia.org/r/572017

Change 572016 merged by Vgutierrez:
[operations/puppet@production] ATS: Puppetize allow_half_open HTTP setting

https://gerrit.wikimedia.org/r/572016

Change 572017 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable allow_half_open in cp4031

https://gerrit.wikimedia.org/r/572017

Change 572149 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: Disable allow_half_open in cp4031"

https://gerrit.wikimedia.org/r/572149

Change 572149 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: Disable allow_half_open in cp4031"

https://gerrit.wikimedia.org/r/572149