Page MenuHomePhabricator

Better understanding of Logstash performance
Open, Stalled, Needs TriagePublic

Description

During investigation of https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20190208-logstash-mediawiki it became clear that performance of a single Logstash instance isn't well understood (e.g. how many logs/s can be processed with acceptable performance?). We'll need to investigate Logstash performance more and if needed tune it, this information will also drive the rate limits discussion in T215900: Introduce logging pipeline input ratelimit

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2019, 2:28 PM

It has been observed that during times of high Kafka traffic (i.e. when a backlog develops because Logstash can't keep up) the CPU on logstash hosts isn't maxed out, with only one thread typically using close to one core and the rest being mostly idle. That suggests to me one limiting factor at the moment might be Kafka consumer parallelism, specifically we have three partitions per topic by default and three logstash ingester hosts per site, so under normal circumstances there's one partition per host being consumed.

I think we should bump Kafka partitions for a subset of topics to say 32 or 64, this way we'll guarantee that each logstash hosts gets multiple partitions and in case of overload we can add more logstash hosts and those will also receive multiple partitions.

I think we should bump Kafka partitions for a subset of topics to say 32 or 64, this way we'll guarantee that each logstash hosts gets multiple partitions and in case of overload we can add more logstash hosts and those will also receive multiple partitions.

Previous art for bumping topic partitions is T213081: Consider increasing kafka logging topic partitions

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Oct 9 2019, 11:31 PM

Change 543873 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: bump kafka-logging default partitions

https://gerrit.wikimedia.org/r/543873

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Fri, Oct 18, 1:40 PM

Change 543873 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: bump kafka-logging default partitions

https://gerrit.wikimedia.org/r/543873

Mentioned in SAL (#wikimedia-operations) [2019-10-28T08:48:13Z] <godog> bump udp_localhost kafka-logging topics to 6 partitions and roll-restart logstash and rsyslog - T215904

Mentioned in SAL (#wikimedia-operations) [2019-10-31T09:23:03Z] <godog> temporarily stop logstash on logstash2006 to test performance with two ingesters only - T215904

Mentioned in SAL (#wikimedia-operations) [2019-10-31T09:37:52Z] <godog> temporarily stop logstash on logstash2005 to test performance with two ingesters only - T215904

fgiunchedi added a comment.EditedThu, Oct 31, 11:43 AM

I did a bunch of tests today in codfw to assess what are the limiting factors to ingestion speed.

TL;DR is that it seems to me we're essentially limited by GC (and thus CPU) speed in this case, at a maximum of ~1.1k logs/s per instance.

Conditions and tests performed:

  • Leaving only one Logstash instance up, so it is easier to gauge effects
  • Increase heap from 1G to 2G and 3G, without noticeable effect
  • Change GC from ParNew to openjdk 8's default, no noticeable effect

Unsurprisingly the heap contains for the most part char and byte arrays, sorted by total size (output from jhat's heap histogram)

Class	Instance Count	Total Size
class [C	447023	423314508
class [B	399817	184785494
class [Lorg.jruby.ext.thread_safe.jsr166e.ConcurrentHashMapV8$Node;	136624	39258368
class [Ljava.lang.Object;	380702	20658344
class [Lorg.jruby.runtime.builtin.IRubyObject;	266886	18537408
class [I	80343	15253556
class org.jruby.RubyFixnum	416644	13332608
class org.jruby.RubyString	357036	12853296
class [Ljava.util.HashMap$Node;	15354	11845120
class org.jruby.util.ByteList	311235	11204460
class org.jruby.ext.atomic.AtomicReferenceLibrary$JRubyReference	283150	9060800
class org.jruby.ext.thread_safe.jsr166e.ConcurrentHashMapV8$Node	283150	7928200
class org.apache.kafka.clients.consumer.ConsumerRecord	106792	7261856
class org.jruby.ext.thread_safe.jsr166e.ConcurrentHashMapV8	136624	7104448
class org.joni.ByteCodeMachine	26322	4843248
class org.jruby.MetaClass	15072	4657248
class java.lang.String	386101	4633212
class org.jruby.RubyHash$RubyHashEntry	104248	4586912
class org.jruby.ext.thread_safe.JRubyCacheBackendLibrary$JRubyCacheBackend	136624	4371968
class org.jruby.RubyArray	105327	4318407
class org.jruby.IncludedModuleWrapper	13521	4177989
class java.util.WeakHashMap$Entry	76205	3962660
class [Lorg.jruby.internal.runtime.methods.DynamicMethod$NativeCall;	38709	3716064
class org.jruby.runtime.Binding	49257	3546504
class org.jruby.ext.thread_safe.jsr166e.LongAdder	136624	2732480
class org.jruby.RubyRange	64604	2648764

With the Elastic 7 upgrade coming up we'll need to perform these tests again and see if there have been significant improvements.

Solutions include:

  • add more logstash instances (either on same hosts or additional hosts)
  • run on faster CPUs
  • dig deeper in logstash configuration/heap and understand where the source of objects/garbage is and optimize/create less garbage

Change 547519 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: set kafka-logging default to 6 partitions

https://gerrit.wikimedia.org/r/547519

fgiunchedi changed the task status from Open to Stalled.Mon, Nov 4, 1:42 PM

I'm stalling this until Elastic 7 cluster is online in T235891: Ingest production logs with ELK7 and we can test Logstash 7 too.

fgiunchedi moved this task from Doing to Blocked on the User-fgiunchedi board.Mon, Nov 4, 1:42 PM

Change 547519 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: set kafka-logging default to 6 partitions

https://gerrit.wikimedia.org/r/547519