I spent some time reviewing our logstash JVMs and noticed that each time we see kafka-logging topic lag, we also see a spike in logstash JVM GC time
At the present time we're running our logstashes with a 1G JVM each, and the underlying hosts are 8GB Ganeti VMs. However, the logstash 7 and 8 performance tuning docs suggest sizing the logstash JVM "no less than 4GB and no more than 8GB" so I think it's worth revisiting our JVM sizing. In theory with larger JVMs we should see less severe GC under the loads that cause lag today.
In addition to the logstash JVM, these logstash hosts also run a 4GB JVM for opensearch.
Overall I think we should try to tune logstash to better cope with logging spikes so that we can perform more rate limiting in logstash and handle log spikes without incurring lag.
Re: next steps a few ideas/options come to mind (not mutually exclusive):
- Upgrade the underlying logstash collector hosts to something like 12GB, and increase the logstash JVM to 4GB
- Move away from colocating opensearch on the logstash collector hosts, freeing up 4GB per VM. Increase the logstash JVM to 4GB
- Provision more logstash collector nodes / scale out
- Explore dedicated bare metal collector hosts as a next logical step for greater cpu/ram commit (logstash hosts at 12GB+ being large consumer of ganeti resources, and aiui are already slow to live migrate due to their size/utilization)
- Explore splitting logstash off and hosting in k8s