Page MenuHomePhabricator

ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled
Closed, ResolvedPublic

Description

as spotted by @BBlack, cp3030 shows 34k ESTABLISHED sockets against port 443 (ats-tls) even when the server was depooled for a few hours.

Comparing cp5007 vs the rest of text nodes (running nginx) show a huge different in sockets as well. Grafana reports ~100k sockets in use for cp5007 and ~37k for the rest of servers

Details

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-10-25T05:03:15Z] <vgutierrez> Applying a SSL handshake timeout of 60 secs on ats-tls/cp5007 - T236458

I'm tracking used TCP sockets on eqsin text nodes in https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=now-1h&to=now, I've manually applied a SSL handshake timeout on cp5007 at 05:03 UTC

ayounsi removed a subscriber: ayounsi.Fri, Oct 25, 5:16 AM
Vgutierrez triaged this task as Normal priority.Fri, Oct 25, 5:31 AM
Vgutierrez moved this task from Triage to TLS on the Traffic board.

Change 546014 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable the SSL handshake timeout and set it to 60 seconds

https://gerrit.wikimedia.org/r/546014

Change 546014 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable the SSL handshake timeout and set it to 60 seconds

https://gerrit.wikimedia.org/r/546014

Mentioned in SAL (#wikimedia-operations) [2019-10-25T08:02:22Z] <vgutierrez> rolling restart of ats-tls to introduce a SSL handshake timeout of 60 secs - T236458

Vgutierrez closed this task as Resolved.Fri, Oct 25, 9:46 AM
Vgutierrez reopened this task as Open.Tue, Oct 29, 7:44 AM

Reopening cause the issue hasn't been solved as it can be seen here: https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=1571989146430&to=now

Mentioned in SAL (#wikimedia-operations) [2019-10-29T08:06:05Z] <vgutierrez> restarting ats-tls on cp5007 with TCP FASTOPEN disabled - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-29T09:27:27Z] <vgutierrez> restart ats-tls on cp5007 disabling TCP SO_LINGER - T236458

Change 546957 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Track total client connections for HTTP/2 clients

https://gerrit.wikimedia.org/r/546957

Change 546957 merged by Vgutierrez:
[operations/puppet@production] ATS: Track total client connections for HTTP/2 clients

https://gerrit.wikimedia.org/r/546957

Mentioned in SAL (#wikimedia-operations) [2019-10-29T15:25:45Z] <vgutierrez> restarting ats-tls on cp5007 with a default inactivity timeout of 5 minutes and half open disabled - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-30T02:40:04Z] <vgutierrez> restarting ats-tls on cp3050 with half open disabled - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-30T03:09:06Z] <vgutierrez> Rolling restart of prometheus-exporter-trafficserver-tls - T236458

Mentioned in SAL (#wikimedia-operations) [2019-10-30T04:24:52Z] <vgutierrez> restarting ats-tls on cp4027 with half open disabled - T236458

Change 547073 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Set the default activity timeout to 300 seconds on ats-tls

https://gerrit.wikimedia.org/r/547073

Change 547073 merged by Vgutierrez:
[operations/puppet@production] ATS: Set the default activity timeout to 300 seconds on ats-tls

https://gerrit.wikimedia.org/r/547073

Mentioned in SAL (#wikimedia-operations) [2019-10-30T05:58:10Z] <vgutierrez> Rolling restart of ats-tls to get rid of leaked sockets and benefit from the lower inactivity timeout - T236458

Vgutierrez closed this task as Resolved.Thu, Oct 31, 1:35 AM

This's been successfully mitigated by 9002f6dfc959fccf527b7c7a3778947496858695