Page MenuHomePhabricator

ATS memory leak upon removing healthchecks.so from configuration
Closed, DeclinedPublic

Description

As part of T255015 we merged the following patch yesterday, 2020-06-10 at 13:35 UTC: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/

The change boils down to removing a plugin (healthchecks.so) from /etc/trafficserver/plugin.config and adding some code to do_global_read_request to implement the same functionality offered by healthchecks.so in Lua. Adding/removing global plugins is an operation that in theory requires a trafficserver restart to take effect. On every node, as soon as puppet finished running (and thus triggering a config reload after changing the files - but of course not restarting the process) memory usage for trafficserver.service unexpectedly started ramping up:

Screenshot from 2020-06-11 12-22-27.png (1×2 px, 486 KB)

Here is what happened on cp3052 as an example, similar timeline on other nodes:

Jun 10 13:55:20 cp3052 systemd[1]: Reloaded Apache Traffic Server is a fast, scalable and extensible caching proxy server..

After some minutes the leak resulted in traffic_manager (the supervisor process responsible among other things for starting traffic_server if needed) crashing due to lack of memory.

Jun 10 14:11:11 cp3052 traffic_manager[23642]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory

At the same time, the process responsible for serving traffic (traffic_server) also crashed and wasn't restarted by traffic_manager which was dead too as explained above.

Jun 10 14:11:11 cp3052 traffic_server[23680]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory

Part of this change, namely removing healthchecks.so from plugin.config (but not the Lua code change) was applied to trafficserver-tls.service too. No memory leak, or any other sort of problem, happened in the case of trafficserver-tls. An important distinction when it comes to healthchecks between trafficserver.service and trafficserver-tls.service is that the former handles thousands of requests per second from varnish-fe (so many due to varnish bug T236754), while the latter gets zero.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jun 11 2020, 10:23 AM
ema updated the task description. (Show Details)
ema moved this task from Backlog to Caching on the Traffic board.
BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall subscribed.

Is this a ticket that we'd want any action on? Removal of healthcheck.so is unlikely to ever happen again (we're not dynamically loading/unloading the .so with a service reload!); I don't see this being relevant to WMF. I could certainly see it being valuable as an upstream bug report but this ticket is so old that we'd have to spend time testing the latest version. This embodies a relic of an awkward migration from 2020.