Increase logstash collector heap size
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	herron
	Nov 2 2023, 7:45 PM

Description

I spent some time reviewing our logstash JVMs and noticed that each time we see kafka-logging topic lag, we also see a spike in logstash JVM GC time

Screen Shot 2023-11-02 at 2.54.54 PM.png (1×3 px, 440 KB)

At the present time we're running our logstashes with a 1G JVM each, and the underlying hosts are 8GB Ganeti VMs. However, the logstash 7 and 8 performance tuning docs suggest sizing the logstash JVM "no less than 4GB and no more than 8GB" so I think it's worth revisiting our JVM sizing. In theory with larger JVMs we should see less severe GC under the loads that cause lag today.

In addition to the logstash JVM, these logstash hosts also run a 4GB JVM for opensearch.

Overall I think we should try to tune logstash to better cope with logging spikes so that we can perform more rate limiting in logstash and handle log spikes without incurring lag.

Re: next steps a few ideas/options come to mind (not mutually exclusive):

Upgrade the underlying logstash collector hosts to something like 12GB, and increase the logstash JVM to 4GB
Move away from colocating opensearch on the logstash collector hosts, freeing up 4GB per VM. Increase the logstash JVM to 4GB
Provision more logstash collector nodes / scale out
Explore dedicated bare metal collector hosts as a next logical step for greater cpu/ram commit (logstash hosts at 12GB+ being large consumer of ganeti resources, and aiui are already slow to live migrate due to their size/utilization)
Explore splitting logstash off and hosting in k8s

Details

	Subject	Repo	Branch	Lines +/-
	logstash: increase heap to 4g	operations/puppet	production	+1 -1

Customize query in gerrit

Event Timeline

herron triaged this task as Medium priority.Nov 2 2023, 7:45 PM

herron created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2023, 7:45 PM

Personally I'm for doing option 1 right away, seeing how we handle the next log spike(s) and go from there. One of the main upsides from my view is this option changes the fewest things, essentially only the logstash JVM size as opposed to shuffling services around or introducing new host variants.

Longer term I'm for options 4 and 5, since the collector footprint is growing to a size where Ganeti VMs are seeming like less of a good fit.

Would we have enough Ganeti resources to bump all the VMs to 12GB? If not and as a test we could bump one and see how it fares compared to the rest. Alternatively option 2 seems attractive to me too since we wouldn't need to resize anything

In T350434#9308048, @fgiunchedi wrote:

Would we have enough Ganeti resources to bump all the VMs to 12GB?

Yes, according to https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management we have resources for the additional 24G (6x4G) in each site.

Trying option 1 seems like a good start to try handling memory size issues. Note that we may want to adjust logstash tuning as well afterwards.

I wish there was a good way to do some local stress tests.

In T350434#9309621, @herron wrote:

In T350434#9308048, @fgiunchedi wrote:

Would we have enough Ganeti resources to bump all the VMs to 12GB?

Yes, according to https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management we have resources for the additional 24G (6x4G) in each site.

SGTM! cc @Muehlenhoff as heads up

In T350434#9311619, @fgiunchedi wrote:

SGTM! cc @Muehlenhoff as heads up

SGTM, if we expect further growth for these beyond the 12G, let's consider moving these to dedicated machines.

Thanks for the input everyone! Sounds like we have a consensus on option 1. I'll get started with rolling collector VM reboots into 12GB memory then upload a patch for the JVMs and go from there.

Mentioned in SAL (#wikimedia-operations) [2023-11-07T18:30:21Z] <herron> performing rolling memory increase on logstash collector VMs T350434