As part of T255015 we merged the following patch yesterday, 2020-06-10 at 13:35 UTC: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/
The change boils down to removing a plugin (healthchecks.so) from /etc/trafficserver/plugin.config and adding some code to do_global_read_request to implement the same functionality offered by healthchecks.so in Lua. Adding/removing global plugins is an operation that in theory requires a trafficserver restart to take effect. On every node, as soon as puppet finished running (and thus triggering a config reload after changing the files - but of course not restarting the process) memory usage for trafficserver.service unexpectedly started ramping up:
Here is what happened on cp3052 as an example, similar timeline on other nodes:
Jun 10 13:55:20 cp3052 systemd[1]: Reloaded Apache Traffic Server is a fast, scalable and extensible caching proxy server..
After some minutes the leak resulted in traffic_manager (the supervisor process responsible among other things for starting traffic_server if needed) crashing due to lack of memory.
Jun 10 14:11:11 cp3052 traffic_manager[23642]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory
At the same time, the process responsible for serving traffic (traffic_server) also crashed and wasn't restarted by traffic_manager which was dead too as explained above.
Jun 10 14:11:11 cp3052 traffic_server[23680]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory
Part of this change, namely removing healthchecks.so from plugin.config (but not the Lua code change) was applied to trafficserver-tls.service too. No memory leak, or any other sort of problem, happened in the case of trafficserver-tls. An important distinction when it comes to healthchecks between trafficserver.service and trafficserver-tls.service is that the former handles thousands of requests per second from varnish-fe (so many due to varnish bug T236754), while the latter gets zero.