During investigation of https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20190208-logstash-mediawiki it became clear that performance of a single Logstash instance isn't well understood (e.g. how many logs/s can be processed with acceptable performance?). We'll need to investigate Logstash performance more and if needed tune it, this information will also drive the rate limits discussion in T215900: Introduce logging pipeline input ratelimit
|operations/puppet : production||hieradata: set kafka-logging default to 6 partitions|
|operations/puppet : production||hieradata: bump kafka-logging default partitions|
It has been observed that during times of high Kafka traffic (i.e. when a backlog develops because Logstash can't keep up) the CPU on logstash hosts isn't maxed out, with only one thread typically using close to one core and the rest being mostly idle. That suggests to me one limiting factor at the moment might be Kafka consumer parallelism, specifically we have three partitions per topic by default and three logstash ingester hosts per site, so under normal circumstances there's one partition per host being consumed.
I think we should bump Kafka partitions for a subset of topics to say 32 or 64, this way we'll guarantee that each logstash hosts gets multiple partitions and in case of overload we can add more logstash hosts and those will also receive multiple partitions.
I did a bunch of tests today in codfw to assess what are the limiting factors to ingestion speed.
TL;DR is that it seems to me we're essentially limited by GC (and thus CPU) speed in this case, at a maximum of ~1.1k logs/s per instance.
Conditions and tests performed:
- Leaving only one Logstash instance up, so it is easier to gauge effects
- Increase heap from 1G to 2G and 3G, without noticeable effect
- Change GC from ParNew to openjdk 8's default, no noticeable effect
Unsurprisingly the heap contains for the most part char and byte arrays, sorted by total size (output from jhat's heap histogram)
Class Instance Count Total Size class [C 447023 423314508 class [B 399817 184785494 class [Lorg.jruby.ext.thread_safe.jsr166e.ConcurrentHashMapV8$Node; 136624 39258368 class [Ljava.lang.Object; 380702 20658344 class [Lorg.jruby.runtime.builtin.IRubyObject; 266886 18537408 class [I 80343 15253556 class org.jruby.RubyFixnum 416644 13332608 class org.jruby.RubyString 357036 12853296 class [Ljava.util.HashMap$Node; 15354 11845120 class org.jruby.util.ByteList 311235 11204460 class org.jruby.ext.atomic.AtomicReferenceLibrary$JRubyReference 283150 9060800 class org.jruby.ext.thread_safe.jsr166e.ConcurrentHashMapV8$Node 283150 7928200 class org.apache.kafka.clients.consumer.ConsumerRecord 106792 7261856 class org.jruby.ext.thread_safe.jsr166e.ConcurrentHashMapV8 136624 7104448 class org.joni.ByteCodeMachine 26322 4843248 class org.jruby.MetaClass 15072 4657248 class java.lang.String 386101 4633212 class org.jruby.RubyHash$RubyHashEntry 104248 4586912 class org.jruby.ext.thread_safe.JRubyCacheBackendLibrary$JRubyCacheBackend 136624 4371968 class org.jruby.RubyArray 105327 4318407 class org.jruby.IncludedModuleWrapper 13521 4177989 class java.util.WeakHashMap$Entry 76205 3962660 class [Lorg.jruby.internal.runtime.methods.DynamicMethod$NativeCall; 38709 3716064 class org.jruby.runtime.Binding 49257 3546504 class org.jruby.ext.thread_safe.jsr166e.LongAdder 136624 2732480 class org.jruby.RubyRange 64604 2648764
With the Elastic 7 upgrade coming up we'll need to perform these tests again and see if there have been significant improvements.
- add more logstash instances (either on same hosts or additional hosts)
- run on faster CPUs
- dig deeper in logstash configuration/heap and understand where the source of objects/garbage is and optimize/create less garbage