Following the theme of the parent task I spend some time today experimenting with a performance row in the logstash cluster health dashboard at https://grafana-rw.wikimedia.org/d/oXH_v3rWk/logstash-cluster-health?forceLogin&orgId=1&refresh=30s&from=now-30d&to=now
One of the metrics added there is merge pressure, and it seems like this one is worth a closer look. At present the logging clusters appear to spend between 30-50% of their total merge time in throttled state.
Creating a task to have a look into what can be done to help reduce this merge throttling/pressure
Some ideas off hand:
- T392092: Review logging index refresh_intervals (tested -- did not change performance under current load)
- Tune translog.durability (async)
- Increase merge threads
- Tune index default merge policies
- Spread out the ingest load more T391687
