Page MenuHomePhabricator

ats-tls ran out of FDs on cp1089
Closed, ResolvedPublic

Assigned To
Authored By
Vgutierrez
Mar 28 2020, 8:40 AM
Referenced Files
F31707693: image.png
Mar 28 2020, 8:54 AM
F31707691: image.png
Mar 28 2020, 8:54 AM
F31707674: image.png
Mar 28 2020, 8:43 AM
Tokens
"Pterodactyl" token, awarded by ema.

Description

It looks like cp1089 ran out of FDs:

Mar 27 23:00:14 cp1089 traffic_manager[27021]: [Mar 27 23:00:14.216] [LOG_FLUSH] ERROR: Error opening logging directory /srv/trafficserver/tls/var/log to perform a space check: Too many open files.

and on the next reload to refresh OCSP data ats-tls show errors initialing the SSL context, accessing host.db and reloading the tls.lua script:

Mar 27 23:35:53 cp1089 traffic_manager[27021]: [Mar 27 17:50:07.069] [ET_TASK 1] NOTE: ssl_multicert.config done reloading!
Mar 27 23:35:53 cp1089 traffic_manager[27021]: [Mar 27 22:56:32.353] [ACCEPT 0:443] WARNING: accept thread received transient error: errno = 24
Mar 27 23:35:53 cp1089 traffic_manager[27021]: [Mar 27 23:05:52.570] [ET_TASK 0] WARNING: Unable to finalize sync of cache to disk /srv/trafficserver/tls/var/run/host.db: Bad file descriptor
[...]
Mar 28 05:50:03 cp1089 traffic_manager[27021]: [Mar 28 05:50:03.226] [ET_TASK 1] ERROR: [ts_lua][ts_lua_reload_module] luaL_loadfile /srv/trafficserver/tls/etc/lua/tls.lua failed: cannot open /srv/trafficserver/tls/etc/lua/tls.lua: Too many open files
[...]
Mar 28 06:57:53 cp1089 traffic_manager[27021]: [Mar 28 06:57:52.916] [ET_NET 7] ERROR: failed to create SSL server session

Event Timeline

Vgutierrez triaged this task as Medium priority.Mar 28 2020, 8:54 AM
Vgutierrez moved this task from Backlog to TLS on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2020-03-28T11:32:31Z] <vgutierrez> restart ats-tls on cp1077 - T248736

Change 584110 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "ATS: disable transaction_active_timeout_in for EventStreams"

https://gerrit.wikimedia.org/r/584110

Mentioned in SAL (#wikimedia-operations) [2020-03-28T12:05:29Z] <vgutierrez> preemptive restart of ats-tls on cp1081 and cp3062 - T248736

Change 584110 merged by Vgutierrez:
[operations/puppet@production] Revert "ATS: disable transaction_active_timeout_in for EventStreams"

https://gerrit.wikimedia.org/r/584110

Would love to get this fixed ASAP! Let me know if I can help!

Is there a more permanent fix? Any idea why ATS was leaking the socket FDs?

Change 593517 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disabling transaction_active_timeout_in for test eventstreams

https://gerrit.wikimedia.org/r/593517

Is there a more permanent fix? Any idea why ATS was leaking the socket FDs?

Nope, we'll try to reproduce next week in isolation with https://gerrit.wikimedia.org/r/593517

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Vgutierrez claimed this task.

I believe that we can safely close this one now as we moved away from ats-tls