As part of T255015 we merged the following patch yesterday, 2020-06-10 at 13:35 UTC: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/604305/
The change boils down to removing a plugin (healthchecks.so) from `/etc/trafficserver/plugin.config` and adding some code to `do_global_read_request` to implement the same functionality offered by `healthchecks.so` in Lua. Adding/removing global plugins is an operation that in theory requires a trafficserver restart to take effect. On every node, as soon as puppet finished running (and thus triggering a config reload after changing the files - but of course **not** restarting the process) memory usage for `trafficserver.service` unexpectedly started [[ https://grafana.wikimedia.org/explore?orgId=1&left=%5B%221591795113351%22,%221591802309457%22,%22esams%20prometheus%2Fops%22,%7B%22expr%22:%22container_memory_usage_bytes%7Bid%3D%5C%22%2Fsystem.slice%2Ftrafficserver.service%5C%22%7D%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D| ramping up ]]:
{F31861256}
Here is what happened on cp3052 as an example, similar timeline on other nodes:
```
Jun 10 13:55:20 cp3052 systemd[1]: Reloaded Apache Traffic Server is a fast, scalable and extensible caching proxy server..
```
After some minutes the leak resulted in `traffic_manager` (the supervisor process responsible among other things for starting `traffic_server` if needed) crashing due to lack of memory.
```
Jun 10 14:11:11 cp3052 traffic_manager[23642]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory
```
At the same time, the process responsible for serving traffic (`traffic_server`) also crashed and wasn't restarted by `traffic_manager` which was dead too as explained above.
```
Jun 10 14:11:11 cp3052 traffic_server[23680]: Fatal: couldn't allocate 1048576 bytes at alignment 4096 - insufficient memory
```
Part of this change, namely removing `healthchecks.so` from `plugin.config` (but not the Lua code change) was applied to `trafficserver-tls.service` too. No memory leak, or any other sort of problem, happened in the case of `trafficserver-tls`. An important distinction when it comes to healthchecks between `trafficserver.service` and `trafficserver-tls.service` is that the former handles thousands of requests per second from varnish-fe (so many due to varnish bug T236754), while the latter gets zero.