Page MenuHomePhabricator

kafka / logstash / elasticsearch lag monitoring and alerting
Open, Needs TriagePublic

Description

Today we have experienced an issue where on kibana it looked like periods of (recent) time had only very few logs, though upon waiting and reloading the levels came back to normal, which seems to indicate some sort of indexing lag.

We should monitor such lag (really of the whole logging pipeline) and alert when out of acceptable levels

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2019, 3:01 PM

It was characterized by "holes" in MW log events:


At some point (20minutes later?) these holes disappeared.

Also of note the fact that ATM we're using the same logstash group_id for the two kafka inputs we have into logstash (namely input/kafka/rsyslog-shipper and input/kafka/rsyslog-udp-localhost and one consumer thread each. One of the bottlenecks might be consuming from kafka too, ATM we're consuming from a single logstash host (I've added a new panel to logstash grafana dashboard to show this)

Change 485812 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: set consumer_threads for kafka input

https://gerrit.wikimedia.org/r/485812

Mentioned in SAL (#wikimedia-operations) [2019-01-22T13:55:09Z] <godog> bump logstash kafka consumer threads - T214309

Change 485812 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: set consumer_threads for kafka input

https://gerrit.wikimedia.org/r/485812

Also of note the fact that despite we have three partitions per topic, only one of the brokers is receiving most of the messages, this is in itself a problem, which is also causing only one of the logstash hosts to consume all messages, whereas the other two consume no messages. This imbalance I believe is also what is causing the lag since messages are processed by one logstash instance only.

Change 485833 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: enable auto partitions when producing to kafka

https://gerrit.wikimedia.org/r/485833

Change 485833 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: enable auto partitions when producing to kafka

https://gerrit.wikimedia.org/r/485833

Mentioned in SAL (#wikimedia-operations) [2019-01-22T15:14:53Z] <godog> turn on partitions.auto for rsyslog output to kafka - T214309

herron added a subscriber: herron.Jan 22 2019, 3:49 PM

Messages now are spread amongst broker as expected after https://gerrit.wikimedia.org/r/485833 and I believe the immediate issue (i.e. lag) has been resolved, however I'm leaving this task open for the general issue of monitoring / alerting both kafka consumer lag and ingestion/indexing lag

fgiunchedi renamed this task from logstash / elasticsearch indexing lag to kafka / logstash / elasticsearch lag monitoring and alerting.Jan 22 2019, 4:52 PM