Page MenuHomePhabricator

Increase logstash collector heap size
Closed, ResolvedPublic

Description

I spent some time reviewing our logstash JVMs and noticed that each time we see kafka-logging topic lag, we also see a spike in logstash JVM GC time

Screen Shot 2023-11-02 at 2.54.54 PM.png (1×3 px, 440 KB)

At the present time we're running our logstashes with a 1G JVM each, and the underlying hosts are 8GB Ganeti VMs. However, the logstash 7 and 8 performance tuning docs suggest sizing the logstash JVM "no less than 4GB and no more than 8GB" so I think it's worth revisiting our JVM sizing. In theory with larger JVMs we should see less severe GC under the loads that cause lag today.

In addition to the logstash JVM, these logstash hosts also run a 4GB JVM for opensearch.

Overall I think we should try to tune logstash to better cope with logging spikes so that we can perform more rate limiting in logstash and handle log spikes without incurring lag.

Re: next steps a few ideas/options come to mind (not mutually exclusive):

  1. Upgrade the underlying logstash collector hosts to something like 12GB, and increase the logstash JVM to 4GB
  2. Move away from colocating opensearch on the logstash collector hosts, freeing up 4GB per VM. Increase the logstash JVM to 4GB
  3. Provision more logstash collector nodes / scale out
  4. Explore dedicated bare metal collector hosts as a next logical step for greater cpu/ram commit (logstash hosts at 12GB+ being large consumer of ganeti resources, and aiui are already slow to live migrate due to their size/utilization)
  5. Explore splitting logstash off and hosting in k8s

Event Timeline

herron triaged this task as Medium priority.Nov 2 2023, 7:45 PM
herron created this task.

Personally I'm for doing option 1 right away, seeing how we handle the next log spike(s) and go from there. One of the main upsides from my view is this option changes the fewest things, essentially only the logstash JVM size as opposed to shuffling services around or introducing new host variants.

Longer term I'm for options 4 and 5, since the collector footprint is growing to a size where Ganeti VMs are seeming like less of a good fit.

Would we have enough Ganeti resources to bump all the VMs to 12GB? If not and as a test we could bump one and see how it fares compared to the rest. Alternatively option 2 seems attractive to me too since we wouldn't need to resize anything

Would we have enough Ganeti resources to bump all the VMs to 12GB?

Yes, according to https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management we have resources for the additional 24G (6x4G) in each site.

Trying option 1 seems like a good start to try handling memory size issues. Note that we may want to adjust logstash tuning as well afterwards.

I wish there was a good way to do some local stress tests.

Would we have enough Ganeti resources to bump all the VMs to 12GB?

Yes, according to https://grafana.wikimedia.org/d/508r1Jz4z/ganeti-capacity-management we have resources for the additional 24G (6x4G) in each site.

SGTM! cc @Muehlenhoff as heads up

SGTM! cc @Muehlenhoff as heads up

SGTM, if we expect further growth for these beyond the 12G, let's consider moving these to dedicated machines.

Thanks for the input everyone! Sounds like we have a consensus on option 1. I'll get started with rolling collector VM reboots into 12GB memory then upload a patch for the JVMs and go from there.

Mentioned in SAL (#wikimedia-operations) [2023-11-07T18:30:21Z] <herron> performing rolling memory increase on logstash collector VMs T350434

Change 972456 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: increase heap to 4g

https://gerrit.wikimedia.org/r/972456

Change 972456 merged by Herron:

[operations/puppet@production] logstash: increase heap to 4g

https://gerrit.wikimedia.org/r/972456

lmata changed the task status from Open to Stalled.Nov 8 2023, 3:41 PM
lmata moved this task from Inbox to Prioritized on the Observability-Logging board.
lmata subscribed.

Stalling as we're monitoring

herron claimed this task.

We're in stable state with option 1 outlined in the description (increase heap to 4g) completed. Transitioning to resolved

herron renamed this task from Logstash collector tuning to Increase logstash collector heap size.Dec 7 2023, 5:56 PM