Page MenuHomePhabricator

Logstash missing most messages from mediawiki (Aug 2019)
Closed, ResolvedPublic


image.png (445×1 px, 66 KB)

01:24 <+icinga-wm> PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic=udp_localhost-info
01:24 <+icinga-wm> gging-eqiad&var-topic=All&var-consumer_group=All

See T228089 for previous occurance of this.

Event Timeline

Urbanecm triaged this task as Unbreak Now! priority.Aug 20 2019, 11:29 PM

Mentioned in SAL (#wikimedia-operations) [2019-08-20T23:42:54Z] <Urbanecm> Evening SWAT aborted due to no logs logged for some period of time (T230847), no patches were reverted

fgiunchedi lowered the priority of this task from Unbreak Now! to Medium.Aug 21 2019, 8:40 AM
fgiunchedi subscribed.

Downgrading to normal since the situation has stabilized and returned to normal as 23:40 UTC, still pending investigation on what caused it

Looks like a logstash consumer has failed, according to kafka logs on logstash1010

[2019-08-20 23:09:36,042] INFO [GroupCoordinator 1004]: Member logstash-0-b1f8c6cc-31aa-44bb-98df-349ff911179e in group logstash has failed, removing it from t
he group (                                                                                                            [2019-08-20 23:09:36,042] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash with old generation 4127 (__consumer_offsets-49) (kafka.coordinat                                                                                                                                     [2019-08-20 23:09:39,151] INFO [GroupMetadataManager brokerId=1004] Removed 0 expired offsets in 0 milliseconds. (
[2019-08-20 23:09:46,496] INFO [GroupCoordinator 1004]: Member logstash-0-acddeab4-e90a-40f7-a4ff-0ae27c863d89 in group logstash has failed, removing it from t
he group (                                                                                                            [2019-08-20 23:09:47,215] INFO [GroupCoordinator 1004]: Stabilized group logstash generation 4128 (__consumer_offsets-49) (
ator)                                                                                                                                                          [2019-08-20 23:09:47,218] INFO [GroupCoordinator 1004]: Assignment received from leader for group logstash for generation 4128 (
ordinator)                                                                                                                                                     [2019-08-20 23:09:58,916] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash with old generation 4128 (__consumer_offsets-49) (kafka.coordinat                                                                                                                                     [2019-08-20 23:09:59,674] INFO [GroupCoordinator 1004]: Member logstash-0-9f2023d4-d5fb-49d1-b9e2-d5425fdf2d90 in group logstash-codfw has failed, removing it 
from the group (                                                                                                      [2019-08-20 23:09:59,674] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash-codfw with old generation 2450 (__consumer_offsets-37) (kafka.coo
Krinkle renamed this task from Logstash gets significantly lower number of messages from mediawiki to Logstash missing most messages from mediawiki (Aug 2019).Dec 17 2019, 10:18 PM

I've created an initial draft at

Some open questions still:

  • Did we lose the logs from these 30 minutes, or were they recoverable (and recovered) from Kafka?
  • Could this "Kafka failure" happen again? Did we detect it early enough?
  • Was the mitigation trivial / automatable?

Thanks @Krinkle, to answer your questions:

  • Yes logs were in Kafka and fully recovered once the consumers caught up
  • We did detect it early enough I think, it is a logstash failure not a kafka failure though
  • Mitigation was to restart logstash IIRC, it could be automatable but IIRC we haven't seen another occurrence and we're moving to logstash 7 this quarter, IMHO good to close

I'm not even convinced an incident report is in order in this case, but I'm willing to be convinced otherwise

Thx, I've updated the incident page. I'll close this now then.