Yesterday during an incident the logging pipeline became severely backlogged, with input logs/s on kafka being in the order of ~25k/s and ingestion on logstash maxing out at ~4k/s for logstash 5 and ~7.5k/s for logstash 7. Normal volume of logs is ~1.5k/s, I think for capacity purposes we should have at least 10x normal volume available.
Possible solutions at their tradeoffs (non-exhaustive list)
- Add more Logstash+ES instance (i.e. the "frontend") instances to ingest data. Easy (more VMs, or baremetal) in the short term
- Investigate generic performance of Logstash ingestion (cfr T215904: Better understanding of Logstash performance). Including consuming from Kafka performance and producting to ES. Unknown complexity but sth to be done eventually
- Investigate ingestion performance of ES in isolation, IOW writing to ES could be the bottleneck too
- Throttle messages first, then ingest. MW is the biggest producer of logs and one very much subject to be spamming. Right now we're throttling within logstash and that has helped, however I suspect it is too late in cases like log spamming. Having throttling applied before ingestion would yield better results. There's a bunch of unknown unknowns (e.g. does Logstash with only throttle and input/output to Kafka have sufficient performance?) but could be a reasonable stopgap and generally applicable for logspam situations